A big shortcoming of FATE so far has been its inability to test perceptual audio and video codecs. This is because when FATE runs a test, it compares the output to a known value in its own database, and the output needs to match the known value precisely, i.e., “bit-exact”. The problem with codecs classed as perceptual is that they are not specified to decode in a bit-exact manner. So, for example, decoding the Ogg Vorbis audio file abc.ogg on x86_64 and on PowerPC will produce 2 waves that, though they may sound identical to most listeners, are not precisely the same down to the PCM sample level; minor variations exist (generally, +/- 1).
I have a plan for adapting FATE to handle this. It may seem a little (or a lot) crazy, but hear me out.
At first, I am only thinking about perceptual audio codecs. This will include Vorbis, MP3, AAC, WMA, QDesign, Real-cook, Real-28_8, and a bunch of others I am forgetting at the moment.
The big idea is to store reference decoded waves and then, for each perceptual audio decoding test, decode the file and compare the wave to its reference wave; fail the test if the difference of any of the PCM points is greater than 1.
How to perform the comparison? I have a few ideas:
- Craft a default Python algorithm that painstakingly unpacks each byte from both waves, iterates along each, and calculates the absolute value at each sample.
- Allow for a FATE installation to call out to a more efficient helper program, one preferably written using SIMD instructions that could read 16 bytes at a time from each wave, and perform absolute value calculations in parallel. I’m thinking a parallel subtract, followed by a parallel absolute value, followed by a bitwise AND should reveal if any of the 16 bytes is outside of tolerance.
- Any other tricks would be appreciated, especially regarding the default algorithm. Are there any special numerical tricks for determining the information I need from 4 bytes in parallel, packed in a 32-bit integer, without SIMD?
This has the potential to be big, sample-wise. It occurred to me to use FLAC to mitigate storage problems. My first impulse was to store the reference waves as FLAC files in a FATE installation’s sample suite. They would be decoded as needed during a build/test cycle. Decoding FLAC is reasonably fast, after all. However, the more I think about it, I think that part is a silly solution. As a compromise, I may store the reference waves as FLAC in the central MPlayerhq.hu FATE suite archive in order to mitigate storage and transfer requirements. It will also be time to create a small, standard syncing script that performs both the samples rsync and decompresses any new FLAC wave references in the archive.
All of this is highly speculative at this point. I don’t know how much storage these hypothetical reference waves are going to require. And I don’t know how long it’s going to take in practice to perform all the comparisons. And of course, I don’t know if the +/- 1 tolerance idea will hold up. Although cursory tests have been positive.
I know it’s a mathematically “impure” solution. But we need something and I wanted to get this possibly workable idea out.
One idea: use -f s16le – then your comparison program doesn’t even have to parse a file format
Also keep in mind that each pcm sample will be 16 bits (8 bits would probably cut out too much info)
And at the file sizes we’re talking about (< 2MB for ~10 secs of stereo) pretty much any language should be fast enough to just read two bytes, subtract them, and throw an error if the difference is more than +/- 1
Err, to clarify, your bottleneck is almost certainly going to be reading the files from disk, so there isn’t much point in using simd, etc.
For AAC I would like to see the AAC conformance guidelines used. Rather than being so strict as to enforce your +/-1 error tolerance, they base their conformance threshold on standard deviation.
I think while your idea is possibly novel, if there are existing practices defined for conformance for a codec, these should be adhered to rather than this ad hoc approach to lossy audio.
tiny_psnr can already compare wav files pretty good and it can report several metrics. More could be added if needed. The thing is we should generate metrics for several small durations and compare those. Otherwise lots of small +1 differences could add up.
I remember for the MPEG conformance part (layer 1/2/3) that they had compressed test files (sine sweeps) and then two
levels of conformance:
– Strictly conformant if no sample is off by more than one
from the MPEG reference decoder and the PSNR fulfills
threshold X.
– Loosely conformant (can’t remember the exact term) if the
PSNR fulfills threshold Y.
So I think the +/-1 test alone might not have enough relevance
for such a test.
I’m favorable to adding an as strict as possible test (a maximum number of +-1 differences if possible) to catch the introduction of bugs in the code that would cause a little PSNR change (like trashing just the first sample of a file).
But it would be nice also to add AAC/MPEG test files and have our tests fail it is in non-conform (but if we do stricter testing, it should be automatic).
To be clear: I also want 2 levels of testing, at least where formal conformance vectors or original samples are available. The first level is comparing the decoded output to the original waveform to verify that FFmpeg can meet whatever quality standards set forth by the standard. The second level is to take that qualified output and make it the reference from which no other decoded waveforms are allowed to deviate from more than +/- 1.
About PSNR, I have been working that angle from FATE’s beginning. Unfortunately, I can barely spell PSNR, and I certainly don’t understand how to interpret the results of the tiny_psnr program included with FFmpeg. I need a tool that tells me, yes or no, if an audio wave is within tolerance (given a reference wave). tiny_psnr only returns -1 if there is a file problem or 0 in any other circumstance. Maybe fate-script.py could be expected to parse the textual result, but we come back to the original problem: I don’t know what the result numbers mean, or what’s “good”. People tell me that information about what is “good” is in “that one spec, somewhere”, but can never provide more useful information.
Plus, I know that there is a bunch of MPEG conformance PCM media that is stored in 24-bit PCM. That’s 50% more data than I care about, and I really don’t want to store that in FATE installations (this is going to be big enough as it is).