Tag Archives: fate

Science Into Engineering

I modified my distributed RPC test staging utility to implement my imprecise audio testing idea. This is the output under typical conditions:

There was 1 unique stdout blob collected
all successful configurations agreed on this stdout blob:
pass

So, it worked. Yeah, I’m surprised too. That result means that all the configurations (20 total) produce an audio waveform in which no individual PCM sample deviates from the reference wave by more than 1. Since I had to choose some configuration to generate the reference sample, I used Linux / x86_32 / gcc 2.95.3.

BTW, this is the general Python algorithm I am using to compare the waves. It takes a full minute, give or take a second, to compare 2 33MB samples:

I replaced abs() with a branch to check if the diff is < -1 or > 1, but that didn’t improve speed measurably. I think the constant unpacking might have something to do with it. Better solutions welcome. (By comparison, performing a comparison using ‘cmp’ of 2 identical files that have the same size as the test above, living on a network share, takes less than 2 seconds.)

For a 10-second sample of a .m4a stereo AAC file (882,000 samples), these are the number of PCM samples that deviated by 1 (first number), and by more than 1 (second number). You will notice that no samples deviated by more than 1, which was my hypothesis at the start, and the basis on which I devised this plan:

Mac OS X / PPC / gcc 4.0.1
432691, 0

Linux / x86_32 / icc
238, 0

Linux / x86_32 / gcc 2.95.3
0, 0

Linux / PPC / gcc 4.0.4
Linux / PPC / gcc 4.1.2
Linux / PPC / gcc 4.2.4
Linux / PPC / gcc 4.3.2
Linux / PPC / gcc svn
432701, 0

Linux / x86_64 / gcc 4.0.4
Linux / x86_64 / gcc 4.1.2
Linux / x86_64 / gcc 4.2.4
Linux / x86_64 / gcc 4.3.2
Linux / x86_64 / gcc svn
248, 0

Linux / x86_32 / gcc 3.4.6
Linux / x86_32 / gcc 4.0.4
Linux / x86_32 / gcc 4.1.2
Linux / x86_32 / gcc 4.2.4
Linux / x86_32 / gcc 4.3.2
Linux / x86_32 / gcc svn
237, 0

Mac OS X / x86_64 / gcc 4.0.1
244, 0

I have thrown RealAudio Cooker and 28.8 samples at this, and both work. I am still testing this against some more audio samples to ensure that this idea holds water.

Improving The Science

Apparently, Aurel thought my proposed imprecise audio testing method for FATE had some merit, even if everyone else thinks I’m crazy. Aurel proposed and prototyped a new PCM encoder called pcm_s16le_trunc. This normalizes a sequence of signed, 16-bit PCM samples according to the formula

sample[n] = (sample[n] + 1) & ~0x03

So it sort of smooths out the rough edges of a PCM wave in the hopes of making it possible to compare to reference waves in a bit-exact manner. Unfortunately, deploying the method on my distributed RPC testing system yields no success. For example, testing a particular AAC file across 18 different FFmpeg configurations results in 7 unique sets of output, which group the same way as without ‘-f pcm_s16le_trunc’, though with differing output.

I still think it’s an interesting idea. Any other refinements to throw out there?

To reiterate on a comment I left on the last post, I do not wish to completely supplant the official validation methods that are already in place for various audio coding standards, most notably the MPEG audio standards. I envision 2 stages for these audio tests. Really, this parallels what I described when I discussed how I enter tests into FATE in the first place– the first stage is to validate that a given sample produces the correct results. For many test specs in FATE, this involves manually consuming the contents of a media sample with my own eyes and ears to judge if it looks and sounds minimally sane. The first stage for these audio tests will be in the same spirit, only that there will be more mathematically rigorous qualifications for getting into FATE. Beyond just “sounding okay”, files will have to meet certain PSNR thresholds. The second stage is to create reference waves that the audio decoders can be continuously tested against.

What thresholds need to be met, and what files should be used? Well, that varies depending on codec:

  • MP2/MP3/AAC: Plenty of official conformance samples available, and I’m pretty sure I have been sitting idly on most of them for many years. I need to do a little research to determine how close the decoder needs to get the final output.
  • Vorbis: Savior of open source multimedia (at least in the audio domain), while the format is exhaustively documented, I don’t see any specifications regarding quality thresholds that encoders and decoders need to meet, nor is there any conformance suite that I know of. This is particularly troubling since there have been numerous revisions to the format over the years and older variations are undoubtedly still “in the wild”. It would be nice to assemble a sample collection that represents various generations of the format.

    In any case, it will be reasonable to generate our own samples for testing this format.

  • RealAudio codecs such as Cooker and 28_8; also AC3, ATRAC3, QDesign, QCELP, assorted others: I’m not especially motivated to find the software that creates these formats for contriving my own encodings so that I know the source material. There are no known conformance suites, and there is no way that we can know intended, reasonable thresholds. I think we’ll just have to take the “sounds good enough; make new reference wave” approach for these.

One last thought: one prospect I appreciate about this reference wave testing idea is that I wouldn’t have to update the test specs in the database if the output has changed significantly– just update the reference waves (as long as they pass whatever thresholds we put in place).

See Also:

Not An Exact Science

A big shortcoming of FATE so far has been its inability to test perceptual audio and video codecs. This is because when FATE runs a test, it compares the output to a known value in its own database, and the output needs to match the known value precisely, i.e., “bit-exact”. The problem with codecs classed as perceptual is that they are not specified to decode in a bit-exact manner. So, for example, decoding the Ogg Vorbis audio file abc.ogg on x86_64 and on PowerPC will produce 2 waves that, though they may sound identical to most listeners, are not precisely the same down to the PCM sample level; minor variations exist (generally, +/- 1).

I have a plan for adapting FATE to handle this. It may seem a little (or a lot) crazy, but hear me out.

At first, I am only thinking about perceptual audio codecs. This will include Vorbis, MP3, AAC, WMA, QDesign, Real-cook, Real-28_8, and a bunch of others I am forgetting at the moment.

The big idea is to store reference decoded waves and then, for each perceptual audio decoding test, decode the file and compare the wave to its reference wave; fail the test if the difference of any of the PCM points is greater than 1.

How to perform the comparison? I have a few ideas:

  • Craft a default Python algorithm that painstakingly unpacks each byte from both waves, iterates along each, and calculates the absolute value at each sample.
  • Allow for a FATE installation to call out to a more efficient helper program, one preferably written using SIMD instructions that could read 16 bytes at a time from each wave, and perform absolute value calculations in parallel. I’m thinking a parallel subtract, followed by a parallel absolute value, followed by a bitwise AND should reveal if any of the 16 bytes is outside of tolerance.
  • Any other tricks would be appreciated, especially regarding the default algorithm. Are there any special numerical tricks for determining the information I need from 4 bytes in parallel, packed in a 32-bit integer, without SIMD?

This has the potential to be big, sample-wise. It occurred to me to use FLAC to mitigate storage problems. My first impulse was to store the reference waves as FLAC files in a FATE installation’s sample suite. They would be decoded as needed during a build/test cycle. Decoding FLAC is reasonably fast, after all. However, the more I think about it, I think that part is a silly solution. As a compromise, I may store the reference waves as FLAC in the central MPlayerhq.hu FATE suite archive in order to mitigate storage and transfer requirements. It will also be time to create a small, standard syncing script that performs both the samples rsync and decompresses any new FLAC wave references in the archive.

All of this is highly speculative at this point. I don’t know how much storage these hypothetical reference waves are going to require. And I don’t know how long it’s going to take in practice to perform all the comparisons. And of course, I don’t know if the +/- 1 tolerance idea will hold up. Although cursory tests have been positive.

I know it’s a mathematically “impure” solution. But we need something and I wanted to get this possibly workable idea out.

Asking The Right Question

Considering the amount of time and effort I put into developing the entire FATE system, you might be surprised to learn that I would not be at all averse to replacing FATE wholesale with something that worked better. I did research at the outset to see what kind of software systems were out there that would suit our needs and solve all of the problems that I had in mind. But I couldn’t find much useful stuff. To be honest, I wasn’t entirely sure what I was looking for.

In order to find the correct answer, though, it helps immensely to know the right question. Through a series of coincidences, I wound up at the Wikipedia page for continuous integration and realized that this is the category of software that FATE falls into. The Wikipedia page lists many systems that are used along the same lines as FATE.

BuildBot is an interesting one and a system that I think I have seen before. Python-based, good. Example report pages are well-organized, but not as concise as I think they could be (but perhaps it’s configurable). However, I tend to think that there are few continuous integration systems that meet a particular requirement I have, namely that the master server needs to be able to run on PHP since that’s what my web provider offers (Python-CGI, too, as long as I don’t need to talk to a MySQL database).