Better Testing For Inexact Audio | Breaking Eggs And Making Omelettes

Apparently, Aurel thought my proposed imprecise audio testing method for FATE had some merit, even if everyone else thinks I’m crazy. Aurel proposed and prototyped a new PCM encoder called pcm_s16le_trunc. This normalizes a sequence of signed, 16-bit PCM samples according to the formula

sample[n] = (sample[n] + 1) & ~0x03

So it sort of smooths out the rough edges of a PCM wave in the hopes of making it possible to compare to reference waves in a bit-exact manner. Unfortunately, deploying the method on my distributed RPC testing system yields no success. For example, testing a particular AAC file across 18 different FFmpeg configurations results in 7 unique sets of output, which group the same way as without ‘-f pcm_s16le_trunc’, though with differing output.

I still think it’s an interesting idea. Any other refinements to throw out there?

To reiterate on a comment I left on the last post, I do not wish to completely supplant the official validation methods that are already in place for various audio coding standards, most notably the MPEG audio standards. I envision 2 stages for these audio tests. Really, this parallels what I described when I discussed how I enter tests into FATE in the first place– the first stage is to validate that a given sample produces the correct results. For many test specs in FATE, this involves manually consuming the contents of a media sample with my own eyes and ears to judge if it looks and sounds minimally sane. The first stage for these audio tests will be in the same spirit, only that there will be more mathematically rigorous qualifications for getting into FATE. Beyond just “sounding okay”, files will have to meet certain PSNR thresholds. The second stage is to create reference waves that the audio decoders can be continuously tested against.

What thresholds need to be met, and what files should be used? Well, that varies depending on codec:

MP2/MP3/AAC: Plenty of official conformance samples available, and I’m pretty sure I have been sitting idly on most of them for many years. I need to do a little research to determine how close the decoder needs to get the final output.
Vorbis: Savior of open source multimedia (at least in the audio domain), while the format is exhaustively documented, I don’t see any specifications regarding quality thresholds that encoders and decoders need to meet, nor is there any conformance suite that I know of. This is particularly troubling since there have been numerous revisions to the format over the years and older variations are undoubtedly still “in the wild”. It would be nice to assemble a sample collection that represents various generations of the format.
In any case, it will be reasonable to generate our own samples for testing this format.
RealAudio codecs such as Cooker and 28_8; also AC3, ATRAC3, QDesign, QCELP, assorted others: I’m not especially motivated to find the software that creates these formats for contriving my own encodings so that I know the source material. There are no known conformance suites, and there is no way that we can know intended, reasonable thresholds. I think we’ll just have to take the “sounds good enough; make new reference wave” approach for these.

One last thought: one prospect I appreciate about this reference wave testing idea is that I wouldn’t have to update the test specs in the database if the output has changed significantly– just update the reference waves (as long as they pass whatever thresholds we put in place).

See Also:

Not An Exact Science, my original proposal

7 thoughts on “Improving The Science”

Carl Eugen Hoyos January 20, 2009 at 2:56 am

For AC3, the only thing to be fixed is a difference in output depending on whether SSE optimizations are turned on or not (= if gcc 2.95.3 is used or not), then the regression test can be activated by removing two “#” from tests/regression.sh.
Mans January 20, 2009 at 5:30 am

What about other architectures that often have differing results, e.g. ppc?
Matti January 20, 2009 at 5:58 am

Probably it would be wise to create reference waves from “official” binary decoder output, too, for those formats that can be considered to have such.
mat January 20, 2009 at 11:14 am

Turning on or off SSE for AC3 is not a solution. It is was, the universal solution will be to do fixed point in order to get the same result on all architecture. But this will prevent us to test the main code path (fpu code) or comparing against other decoder.
Pengvado January 21, 2009 at 11:00 pm

I think pcm_s16le_trunc is just the wrong idea. You can’t test for approximate equality by any amount of filtering to the separate streams, you have to subtract them first and then filter.
mkhodor January 24, 2009 at 8:17 am

I wouldn’t even bother with 28_8 (G.728) due to its completely braindead design. It uses floating point math for linear prediction, so differences in rounding are fed back into the filter and compounded. Different implemenations produce wildly different results, even when fully compliant with the G.728 spec.

QDesign, ATRAC, and AC3 are transform codecs without prediction, it should be possible to get within +/- 1.
Multimedia Mike Post authorJanuary 24, 2009 at 8:58 am

According to my tests, it has been possible to get 28_8 within +/- 1 against a reference wave generated with FFmpeg. As long as we’re not trying to test multiple implementations in parallel, we should be okay. I can see how it might be a problem if we try to compare FFmpeg’s output to the official binary decoder, though.

Comments are closed.