Monthly Archives: March 2008

How Many IDs In A Database?

Current snapshot of the FATE database:

And we’re just getting started. This might be construed as either long-term planning or silly paranoia, but I have started to wonder what it would take to overflow the id field of the test_result table. I’m not even sure how large it is. MySQL simply reports the database field as being type “int(11)”. I have read various bits of literature which do not give a definitive answer on just how many bits that is. Worst case, I am assuming 32 bits, signed, with a useful positive range around 2 billion. Suppose I ramp up to around 500 unique tests in the database (hey, with all the individual regression tests yet to be imported, as well as various official conformance suites, that’s actually a fairly conservative estimate) and add 6 more configurations to round out to 20. That means each build/test cycle will generate 500 * 20 = 10000 test results. If there are 10 cycles on an average day, that means 100,000 test results per day and 3 million per month. That would last the 31-bit range for about 715 days, or nearly 2 years.

Okay, I guess I will put off worrying about the implications for the time being. But I still need to revise the test_result table to be more efficient (i.e., quit storing the stdout field if it’s the same as was specified in the test specification).

100+ FATE Tests

FATE has been public for 2 months and I have just now reached 100 tests. It’s a nice round number. No slowing down now, though. I hope for that number to go exponential, at least up to the point that FATE carefully tests 98% of FFmpeg‘s total functionality (the last 2% will be fixing bugs that I am logging as I go).

I have also been seriously looking into turning the Mac Mini into a FATE build/test machine for Mac OS X. I’m just trying to decide if I should rush it and get the configuration onto the farm with the current infrastructure, or use it as an opportunity to revise the architecture with the various efficiency brainstorms plotted on this blog. The refactoring needs to occur before I add too many more tests. For the curious, this is what the FATE script looks like while running in a screen session; it wakes up every 15 minutes and checks for a new revision in Subversion:

[Thu Mar  6 15:08:22 2008]  no change
[Thu Mar  6 15:23:26 2008]  getting new revision = 12356
  [Thu Mar  6 15:23:34 2008] building with gcc svn 132381, built 2008-02-17
  [Thu Mar  6 15:25:04 2008] testing...
  [Thu Mar  6 15:25:41 2008] logging...
  [Thu Mar  6 15:26:17 2008] building with gcc 4.0.4
  [Thu Mar  6 15:29:09 2008] testing...
  [Thu Mar  6 15:29:42 2008] logging...
  [Thu Mar  6 15:30:07 2008] building with gcc 4.1.2
  ...

Notice the time delta between logging… and the subsequent building… That delta seems to grow more or less linearly as the number of tests increases. That’s why I’m interested in optimizing that aspect sooner than later.

Test Branching

One nice thing about H.264 is that its transforms are specified to be bit exact. That makes it straightforward to test. Given a compressed sample and its reference stream, your decoder is expected to produce output that matches the reference stream, bit for bit. Microsoft VC-1 is the same way and I suspect that RealVideo 3, RealVideo 4, and Sorenson Video 3 — all of which were proprietary forerunners of the H.264 lineage — express the same characteristic.

So H.264 lends itself well to automated testing, such as the type that FATE subjects it to. Unfortunately, certain other codecs don’t conform well to this model. You may have heard of a few: video codecs such as MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and anything else that uses the same 8×8 discrete cosine transform. Michael thoroughly and visually explains the problem in a blog post.

3 strategies for automatically testing in the face of such adversity:

  1. Maintain one test spec in the database for a particular sample. Test it by comparing it to a reference video stream by using a custom tool that will yield a yea or nay answer indicating if the test output was within an acceptable threshold (perhaps using FFmpeg’s existing tiny_psnr utility, if it can return such concise information). Pros: No modification needed for the fundamental FATE infrastructure. Cons: Requires a lot of raw test media in the formal FATE suite, which is not good considering my future plans to expand the infrastructure to make it easy for other people to rsync the suite and test it locally. Also, I don’t know how to properly compare the data. If I understand the problem correctly, a keyframe with problems will only have various individual samples off by 1. But the error can compound over a sequence of many frames.
  2. Maintain one test spec in the database for a particular sample, but with multiple possible expected stdout text blobs. This would somehow tie expected stdout blobs to certain configurations. Pros: Less storage required for FATE suite and test can maintain bit exactness. Cons: Requires infrastructure modification.
  3. Maintain multiple test specs in the database, each of which correspond to different possible outputs; extend the infrastructure to include test groups. Particular configurations are assigned to run all the tests in a particular group. This is an idea that I think I will need to implement eventually anyway. For example, I can imagine some alternate test methodology for validating FFmpeg, or more likely, libavcodec and libavformat, on certain embedded environments which won’t be able to run ‘ffmpeg’ directly from any sort of command line. Pros: Same as strategy #2. Plus, implementing an idea whose time will probably come anyway. Cons: Same as strategy #2.

I am sort of leaning towards strategy #2 right now. As a possible implementation, the test_spec table might be extended to have 2 expected_stdout fields that are sent to a FATE build/test machine. The machine tests the actual stdout against both sets of expected stdout text and sends the server a code to indicate how the test turned out. Code 1 or 2 means expected stdout 1 or 2 matched, respectively; code 0 means that neither match, and the actual stdout will be sent for storage (also used in case both expected stdout fields were NULL for “don’t care”). So this infrastructure revision can occur at the same time as the one discussed in the previous post.

There’s no reason that this solution can’t co-exist with the grouping idea proposed in strategy #3 when that is implemented. I think grouping will be more useful for completely different platforms with completely different testing programs (‘ffmpeg’ vs. something else for embedded tests), rather than in cases where the command line is exactly the same but various platform results are subtly different.

The biggest benefit for strategies 2 and 3 should be in keeping the size of the FATE suite small. Deploying one of these solutions should also help in automatically testing many perceptual audio codecs (think MP3, AAC, Vorbis, AC3, DTS, etc.) which will also not be bit exact among platforms.

Here’s one big question I have regarding strategy #2: How many expected stdout fields should the database provide for? Is it enough to have 2 different expected stdout fields? 3? More? Should I just go all out and make it a separate table? Yeah, that would probably be the best solution.

I admit, I still don’t completely understand the issues involving the bit inexactness of the decoding processes. Does it vary among processors? Depending upon little vs. big endian? Is it dependent upon C library? Some empirical tests are in order. An impromptu decode of a random MP3 file using FFmpeg yields bit identical PCM data for x86_32, x86_64, and PPC builds of FFmpeg (x86_32 built with both gcc and icc). I guess this makes sense. Bit inexactness for perceptual audio would arise from floating point rounding discrepancies which, per my understanding, are affected by C library. Since this is all Linux, no problem. If I got FFmpeg compiled on Windows, I suspect there would be discrepancies.

I tested a random MJPEG file (THP file, actually), and the frames are the same on x86_32 and x86_64, but different on PPC. So that’s one example of where I will need to employ one of the 3 strategies outlined above.

The Best Type Of Compression

The best type of compression is to encode no data at all.


stdout flow

I’m a little embarrassed to admit that this didn’t occur to me until just now, 2 months after I first deployed the FATE Server. Each test specification in the database has an expected stdout text blob associated with it. The server sends this to the client, who compares the expected stdout with the actual stdout gathered from running a test. The client then sends the actual stdout text back to the server.

Wait! There’s no reason to send the actual stdout back to the server. At least, not if the test was successful. Logically, that means that (actual stdout) == (expected stdout). Send back a special code to indicate that the stdout matched. The server can decide to clone the expected stdout into the actual stdout column in the database under that condition.

Wait, #2! There’s no reason to store the actual stdout if the test is successful. Logically, it’s the same as the data that’s already sitting in the expected stdout field. Which, BTW, is only in the database once. Whereas, actual stdout data occurs many times in a different table.

As you can see, I have been considering optimization strategies. After a client is finished running all the tests for a given configuration, it logs the results for all the tests. There are presently only 90 tests and it seems to take about 30 seconds, give or take 10, to log all the results. That’s a measly 3 records per second, which is annoying, especially since I want this suite to embody hundreds upon hundreds of individual tests eventually. This issue is sort of blocking me from really ramping up on the number of test cases.

Right now. the test clients use the direct MySQL protocol through Python and I doubt that it is being compressed over the wire. I hope to revise the infrastructure so that the test results will be serialized, compressed, and sent to a CGI script on the FATE server. The CGI script will decompress, deserialize, and enter the test results from a position much closer to the actual database server. Hopefully, this will improve performance. If nothing else, it will set the stage for running the FATE client on machines that don’t have working Python MySQLdb libraries, or that can’t access the MySQL port directly due to firewalling.

So that will hopefully address the bandwidth concerns. There is still the issue of disk storage. As discussed previously, raw disk space is really not an issue. I could swallow a gigabyte or 2 per month and still be okay for several years. But it would still be nice for the database to remain a manageable size for the purpose of responsible backups. The idea of not storing actual stdout, rather just a bool to indicate that it checked out, will help to reduce storage requirements. However, I think I should also institute a schedule of “retiring” the stdout/stderr data from old build records and test results.

Someone showed up on ffmpeg-devel yesterday with a bug report that ‘ffmpeg -h’ crashes the program on Solaris. It seems quite reasonable to add a test spec for that simple case with a NULL for expected stdout which would indicate “don’t care”. I would be concerned about filling up so much space with the help command (stdout on ‘ffmpeg -h’ is presently about 28K) on each run. But I might not mind so much if I could retire (ruthlessly delete) the data later.