Category Archives: FATE Server

What’s So Hard About 0xA9?

No matter how much I think I know about about character encoding or trying to work around issues arising from the same, I’ll always get bitten.

For a long time, one particular FATE configuration has shown 87/127 tests succeeding, even though the total number of test specs has crept up over to 300. I investigated this errant configuration on the client side and concluded that it was, in fact, executing all of the tests and sending all the results over the server in one neat package. Apparently, the problem was on the server side. Since it was an older Intel C compiler configuration, I didn’t care about investigating much further.

At one point, some bad bit of code was checked into FFmpeg and all of the results started showing xy/127 tests succeeded. This made the issue a bit more pressing. Mans discovered that the problem had to do with the svq3 test spec failing. The bad code affecting the SVQ3 test was quickly fixed so I didn’t worry about it again until yesterday when, once again, FATE’s various configs were only reporting that 127 tests had been run.

Here’s what was happening: FATE stores the stderr output of a test only if the test spec fails. This is a key data point since everything is fine when the test is successful and FATE tosses the output. The sample used for the SVQ3 test outputs the following metadata (among other data) in the stderr (seen, for example, in this test result):

    copyright       : ? Vertical Online 2001
    copyright-eng   : ? Vertical Online 2001

Those mystery characters map to the byte 0xA9 which is the c-in-a-circle copyright symbol according to UTF tables I can find (at least, U+00A9 is). That byte is making the system choke somewhere along the line, which annoys me greatly. When the client-side Python script executes the test and stuffs the stdout and stderr into the SQLite database, the relevant field is supposed to be a blob– a binary large object. The receiving PHP script on the server is also supposed to honor that blob schema.

Mans’ solution is to specifically encode the stdout/stderr blobs as UTF8 strings in the client-side Python script. That fixes this problem. But I’m confused as to why this is necessary in the first place. Was the PHP script doing its best to interpret the data inside the blob and falling over? Or was the SQLite engine on the server confused by the 0xA9 character in the blob?

Also, I suddenly find myself wondering how the A9 search company got its name.

Update: Thanks for MichaelK. for pointing out the problem. While I was properly converting (since that’s necessary) stdout/stderr from build records to binary type, I never did the same for test result stdout/stderr. I had to do it for the build record output since that was compressed before going into the database. Since the test result data is “just strings”, or so I thought, no reason to do so.

Systematic Benchmarking Adjunct to FATE

Pursuant to my rant on the futility of comparing, performance-wise, the output of various compilers, I wholly acknowledge the utility of systematically benchmarking FFmpeg. FATE is not an appropriate mechanism for doing so, at least not in its normal mode of operation. The “normal mode” would have each of every configuration (60 or so) running certain extended test specs during every cycle. Quite a waste.

Hypothesis: By tracking the performance of a single x86_64 configuration, we should be able to catch performance regressions in FFmpeg.

Proposed methodology: Create a new script that watches for SVN commits. For each and every commit (no skipping), check out the code, build it, and run a series of longer tests. Log the results and move on to the next revision.

What compiler to use? I’m thinking about using gcc 4.2.4 for this. In my (now abandoned) controlled benchmarks, it was the worst performer by a notable margin. I’m thinking that the low performance might help to accentuate performance regressions. Is this a plausible theory? 2 years of testing via FATE haven’t revealed any other major problems with this version.

What kind of samples to test? Thankfully, Big Buck Bunny is available in 4 common formats:

  • MP4/MPEG-4 part 2 video/AC3 audio
  • MP4/H.264 video/AAC audio
  • Ogg/Theora video/Vorbis audio
  • AVI/MS MPEG-4 video/MP3 audio

I have the 1080p versions of all those files, though I’m not sure if it’s necessary to decode all 10 minutes of each. It depends on what kind of hardware I select to run this on.

Further, I may wish to rip an entire audio CD as a single track, encode it with MP3, Vorbis, AAC, WMA, FLAC, and ALAC, and decode each of those.

What other common formats would be useful to track? Note that I only wish to benchmark decoding. My reasoning for this is that decoding should, on the whole, only ever get faster, never slower. Encoding might justifiably get slower as algorithmic trade-offs are made.

I’m torn on the matter of whether to validate the decoding output during the benchmarking test. The case against validation says that computing framecrc’s is going to impact the overall benchmarking process; further, validation is redundant since that’s FATE’s main job. The case for validation says that since this will always be run on the same configuration, there is no need to worry about off-by-1 rounding issues; further, if a validation fails, that data point can be scrapped (which will also happen if a build fails) and will not count towards the overall trend. An errant build could throw off the performance data. Back on the ‘against’ side, that’s exactly what statistical methods like weighted moving averages are supposed to help smooth out.

I’m hoping that graphing this idea for all to see will be made trivial thanks do Google’s Visualization API.

The script would run continuously, waiting for new SVN commits. When it’s not busy with new code, it would work backwards through FFmpeg’s history to backfill performance data.

So, does this whole idea hold water?

If I really want to run this on every single commit, I’m going to have to do a little analysis to determine a reasonable average number of FFmpeg SVN commits per day over the past year and perhaps what the rate of change is (I’m almost certain the rate of commits has been increasing). If anyone would like to take on that task, that would be a useful exercise (‘svn log’, some text manipulation tools, and a spreadsheet should do the trick; you could even put it in a Google Spreadsheet and post a comment with a link to the published document).

First FATE Tests of the New Year

First off, FFmpeg’s Interplay MVE decoder finally supports 16-bit video mode in addition to the original palette mode. I would post a pretty picture showcasing this (as I often do for new game-related formats) but Kostya has already done the honors.

I added some new FATE tests today based on systems that have recently appeared in FFmpeg:

Process Runner Redux

Pursuant to yesterday’s conundrum of creating a portable process runner in Python for FATE that can be reliably killed when exceeding time constraints, I settled on a solution. As Raymond Tau reminded us in the ensuing discussion, Python won’t use a shell to launch the process if the program can supply the command and its arguments as a sequence data structure. I knew this but was intentionally avoiding it. It seems like a simple problem to break up a command line into a sequence of arguments– just split on spaces. However, I hope to test metadata options eventually which could include arguments such as ‘-title “Hey, is this thing on?”‘ where splitting on spaces clearly isn’t the right solution.

I got frustrated enough with the problem that I decided to split on spaces anyway. Hey, I control this system from top to bottom, so new rule: No command line arguments in test specs will have spaces with quotes around them. I already enforce the rule that no sample files can have spaces in their filenames since that causes trouble with remote testing. When I get to the part about testing metadata, said metadata will take the form of ‘-title “HeyIsThisThingOn?”‘ (which will then fail to catch myriad bugs related to FFmpeg’s incorrect handling of whitespace in metadata arguments, but this is all about trade-offs).

So the revised Python process runner seems to work correctly on Linux. The hangaround.c program simulates a badly misbehaving program by eating the TERM signal and must be dealt with using the KILL signal. The last line in these examples is a tuple containing return code, stdout, stderr, and CPU time. For Linux:

$ ./upr.py 
['./hangaround', '40']
process ID = 2645
timeout, sending TERM
timeout, really killing
[-9, '', '', 0]

The unmodified code works the same on Mac OS X:

$ ./upr.py
['./hangaround', '40']
process ID = 94866
timeout, sending TERM
timeout, really killing
[-9, '', '', 0]

Now a bigger test: Running the upr.py script on Linux in order to launch the hangaround process remotely on Mac OS X via SSH:

$ ./upr.py 
['/usr/bin/ssh', 'foster-home', './hangaround', '40']
process ID = 2673
timeout, sending TERM
[143, '', '', 50]

So that’s good… sort of. Monitoring the process on the other end reveals that hangaround is still doing just that, even after SSH goes away. This occurs whether or not hangaround is ignoring the TERM signal. This is still suboptimal.

It would be possible to open a separate SSH session to send a TERM or KILL signal to the original process… except that I wouldn’t know the PID of the remote process. Or could I? I’m open to Unix shell magic tricks on this problem since anything responding to SSH requests is probably going to be acceptably Unix-like. I would rather not go the ‘killall ffmpeg’ route because that could interfere with some multiprocessing ideas I’m working on.

Here’s a brute force brainstorm: When operating in remote-SSH mode, prefix the command with ‘ln -s ffmpeg ffmpeg-<unique-key>’ and then execute the symbolic link instead of the main binary. Then the script should be able to open a separate SSH session and execute ‘killall ffmpeg-<unique-key>’ without interfering with other processes. Outlandish but possibly workable.