Pursuant to my rant on the futility of comparing, performance-wise, the output of various compilers, I wholly acknowledge the utility of systematically benchmarking FFmpeg. FATE is not an appropriate mechanism for doing so, at least not in its normal mode of operation. The “normal mode” would have each of every configuration (60 or so) running certain extended test specs during every cycle. Quite a waste.
Hypothesis: By tracking the performance of a single x86_64 configuration, we should be able to catch performance regressions in FFmpeg.
Proposed methodology: Create a new script that watches for SVN commits. For each and every commit (no skipping), check out the code, build it, and run a series of longer tests. Log the results and move on to the next revision.
What compiler to use? I’m thinking about using gcc 4.2.4 for this. In my (now abandoned) controlled benchmarks, it was the worst performer by a notable margin. I’m thinking that the low performance might help to accentuate performance regressions. Is this a plausible theory? 2 years of testing via FATE haven’t revealed any other major problems with this version.
What kind of samples to test? Thankfully, Big Buck Bunny is available in 4 common formats:
- MP4/MPEG-4 part 2 video/AC3 audio
- MP4/H.264 video/AAC audio
- Ogg/Theora video/Vorbis audio
- AVI/MS MPEG-4 video/MP3 audio
I have the 1080p versions of all those files, though I’m not sure if it’s necessary to decode all 10 minutes of each. It depends on what kind of hardware I select to run this on.
Further, I may wish to rip an entire audio CD as a single track, encode it with MP3, Vorbis, AAC, WMA, FLAC, and ALAC, and decode each of those.
What other common formats would be useful to track? Note that I only wish to benchmark decoding. My reasoning for this is that decoding should, on the whole, only ever get faster, never slower. Encoding might justifiably get slower as algorithmic trade-offs are made.
I’m torn on the matter of whether to validate the decoding output during the benchmarking test. The case against validation says that computing framecrc’s is going to impact the overall benchmarking process; further, validation is redundant since that’s FATE’s main job. The case for validation says that since this will always be run on the same configuration, there is no need to worry about off-by-1 rounding issues; further, if a validation fails, that data point can be scrapped (which will also happen if a build fails) and will not count towards the overall trend. An errant build could throw off the performance data. Back on the ‘against’ side, that’s exactly what statistical methods like weighted moving averages are supposed to help smooth out.
I’m hoping that graphing this idea for all to see will be made trivial thanks do Google’s Visualization API.
The script would run continuously, waiting for new SVN commits. When it’s not busy with new code, it would work backwards through FFmpeg’s history to backfill performance data.
So, does this whole idea hold water?
If I really want to run this on every single commit, I’m going to have to do a little analysis to determine a reasonable average number of FFmpeg SVN commits per day over the past year and perhaps what the rate of change is (I’m almost certain the rate of commits has been increasing). If anyone would like to take on that task, that would be a useful exercise (‘svn log’, some text manipulation tools, and a spreadsheet should do the trick; you could even put it in a Google Spreadsheet and post a comment with a link to the published document).