Test Branching

One nice thing about H.264 is that its transforms are specified to be bit exact. That makes it straightforward to test. Given a compressed sample and its reference stream, your decoder is expected to produce output that matches the reference stream, bit for bit. Microsoft VC-1 is the same way and I suspect that RealVideo 3, RealVideo 4, and Sorenson Video 3 — all of which were proprietary forerunners of the H.264 lineage — express the same characteristic.

So H.264 lends itself well to automated testing, such as the type that FATE subjects it to. Unfortunately, certain other codecs don’t conform well to this model. You may have heard of a few: video codecs such as MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and anything else that uses the same 8×8 discrete cosine transform. Michael thoroughly and visually explains the problem in a blog post.

3 strategies for automatically testing in the face of such adversity:

  1. Maintain one test spec in the database for a particular sample. Test it by comparing it to a reference video stream by using a custom tool that will yield a yea or nay answer indicating if the test output was within an acceptable threshold (perhaps using FFmpeg’s existing tiny_psnr utility, if it can return such concise information). Pros: No modification needed for the fundamental FATE infrastructure. Cons: Requires a lot of raw test media in the formal FATE suite, which is not good considering my future plans to expand the infrastructure to make it easy for other people to rsync the suite and test it locally. Also, I don’t know how to properly compare the data. If I understand the problem correctly, a keyframe with problems will only have various individual samples off by 1. But the error can compound over a sequence of many frames.
  2. Maintain one test spec in the database for a particular sample, but with multiple possible expected stdout text blobs. This would somehow tie expected stdout blobs to certain configurations. Pros: Less storage required for FATE suite and test can maintain bit exactness. Cons: Requires infrastructure modification.
  3. Maintain multiple test specs in the database, each of which correspond to different possible outputs; extend the infrastructure to include test groups. Particular configurations are assigned to run all the tests in a particular group. This is an idea that I think I will need to implement eventually anyway. For example, I can imagine some alternate test methodology for validating FFmpeg, or more likely, libavcodec and libavformat, on certain embedded environments which won’t be able to run ‘ffmpeg’ directly from any sort of command line. Pros: Same as strategy #2. Plus, implementing an idea whose time will probably come anyway. Cons: Same as strategy #2.

I am sort of leaning towards strategy #2 right now. As a possible implementation, the test_spec table might be extended to have 2 expected_stdout fields that are sent to a FATE build/test machine. The machine tests the actual stdout against both sets of expected stdout text and sends the server a code to indicate how the test turned out. Code 1 or 2 means expected stdout 1 or 2 matched, respectively; code 0 means that neither match, and the actual stdout will be sent for storage (also used in case both expected stdout fields were NULL for “don’t care”). So this infrastructure revision can occur at the same time as the one discussed in the previous post.

There’s no reason that this solution can’t co-exist with the grouping idea proposed in strategy #3 when that is implemented. I think grouping will be more useful for completely different platforms with completely different testing programs (‘ffmpeg’ vs. something else for embedded tests), rather than in cases where the command line is exactly the same but various platform results are subtly different.

The biggest benefit for strategies 2 and 3 should be in keeping the size of the FATE suite small. Deploying one of these solutions should also help in automatically testing many perceptual audio codecs (think MP3, AAC, Vorbis, AC3, DTS, etc.) which will also not be bit exact among platforms.

Here’s one big question I have regarding strategy #2: How many expected stdout fields should the database provide for? Is it enough to have 2 different expected stdout fields? 3? More? Should I just go all out and make it a separate table? Yeah, that would probably be the best solution.

I admit, I still don’t completely understand the issues involving the bit inexactness of the decoding processes. Does it vary among processors? Depending upon little vs. big endian? Is it dependent upon C library? Some empirical tests are in order. An impromptu decode of a random MP3 file using FFmpeg yields bit identical PCM data for x86_32, x86_64, and PPC builds of FFmpeg (x86_32 built with both gcc and icc). I guess this makes sense. Bit inexactness for perceptual audio would arise from floating point rounding discrepancies which, per my understanding, are affected by C library. Since this is all Linux, no problem. If I got FFmpeg compiled on Windows, I suspect there would be discrepancies.

I tested a random MJPEG file (THP file, actually), and the frames are the same on x86_32 and x86_64, but different on PPC. So that’s one example of where I will need to employ one of the 3 strategies outlined above.

9 thoughts on “Test Branching

  1. Reimar

    floating point difference due to rounding are IMO more likely due to different hardware (e.g. internal 80 bit registers on x86, possible less on others) and also due to different compilers (are the variables left in the FPU as 80 bit values or are they stored as 32 bit single-precision in-between?).
    If you are lucky even the single-precision float is so much more than you need that you will get no problems due to rounding.
    I guess the MJPEG case is due to different iDCT implementations being used, which one is fastest may depend on the instructions available. It might also be a bug though ;-).

  2. Vitor

    I’d like to point an extra advantage of solution #1: it can tell apart between the two following patches

    ————————
    float f(float a, float b, float c)
    {
    – return (a+b)*c;
    + return a*c + b*c;
    }

    ————————
    float f(float a, float b, float c)
    {
    – return (a+b)*c;
    + return a+b*c
    }

    The first one can change the md5sum and the second one will surely. But if you do not check the psnr, there’s no way to tell the first one was a “cosmetical” change and the second could be a true regression. Not calculating the psnr will limit the usefulness of fate for detecting regressions caused by anything that is expected to change bit (a new IDCT algorithm, for example)…

  3. Multimedia Mike Post author

    Vitor: I don’t really see where you’re going with that example. The change in first example does not alter the result. The change in the second example definitely does. The md5sum will be altered and the test will fail.

    The thing I’m most worried about in going down the bit exact route is that subtle changes in the decoding process (to achieve more precise results) will require me to update vast quantities of stdout text entries.

  4. Vitor

    Sorry, my post is not very clear.

    Actually, I don’t understand completely solutions #2 and #3. If one test gives a framecrc that is not in the database, what you do? Do you always suppose it’s a breakage?

    If someone do a change that gives slightly more precise results for 19 tests but gives horrible clicks and pops to one test (and changes the framecrc of all the 20), will FATE see that it introduced clicks and pops?

  5. Multimedia Mike Post author

    Vitor: Yes, if the stdout yielded by a framecrc (or MD5 hash in the case of testing audio) varies from what is known by the database, that it declared a breakage.

    To address your other concern, there will be a vetting process, probably involving PSNR, to validate that the samples decode reasonably correctly before getting into the database. This will hopefully catch popping and clicking samples.

  6. Vitor

    Mike: Ok, I understand now. Looks fine to me. The only thing I think is necessary is that the vetting process could be automatized. I can imagine in the future a change (or a new arch) that changes the md5 of hundreds of tests…

  7. Multimedia Mike Post author

    Yeah, I can envision that same scenario. And believe me, it keeps me up at night. :-)

  8. James

    Mike,

    The MultimediaWiki links in the wc3movie-xan test don’t work correctly. It looks like some sneaky backslashes have appeared in there.

    Otherwise keep up the good work. I’m finding reading about automated testing much more interesting than I would have previously thought possible. :-)

  9. Multimedia Mike Post author

    Thanks for the bug report, James. There are still some bugs in my private web-based admin tools. Unfortunately, since I’m the only user, I don’t have much incentive to fix them rather than work around them. :-)

    I’m glad someone else gets a kick out of this stuff. Fortunately, I’m enjoying it enough that it’s self-motivating. There are all kinds of interesting problems to solve here. Not difficult problems, necessarily, but it’s still a nice break from the rote multimedia hacking.

Comments are closed.