Systematic Benchmarking Adjunct to FATE | Breaking Eggs And Making Omelettes

Pursuant to my rant on the futility of comparing, performance-wise, the output of various compilers, I wholly acknowledge the utility of systematically benchmarking FFmpeg. FATE is not an appropriate mechanism for doing so, at least not in its normal mode of operation. The “normal mode” would have each of every configuration (60 or so) running certain extended test specs during every cycle. Quite a waste.

Hypothesis: By tracking the performance of a single x86_64 configuration, we should be able to catch performance regressions in FFmpeg.

Proposed methodology: Create a new script that watches for SVN commits. For each and every commit (no skipping), check out the code, build it, and run a series of longer tests. Log the results and move on to the next revision.

What compiler to use? I’m thinking about using gcc 4.2.4 for this. In my (now abandoned) controlled benchmarks, it was the worst performer by a notable margin. I’m thinking that the low performance might help to accentuate performance regressions. Is this a plausible theory? 2 years of testing via FATE haven’t revealed any other major problems with this version.

What kind of samples to test? Thankfully, Big Buck Bunny is available in 4 common formats:

MP4/MPEG-4 part 2 video/AC3 audio
MP4/H.264 video/AAC audio
Ogg/Theora video/Vorbis audio
AVI/MS MPEG-4 video/MP3 audio

I have the 1080p versions of all those files, though I’m not sure if it’s necessary to decode all 10 minutes of each. It depends on what kind of hardware I select to run this on.

Further, I may wish to rip an entire audio CD as a single track, encode it with MP3, Vorbis, AAC, WMA, FLAC, and ALAC, and decode each of those.

What other common formats would be useful to track? Note that I only wish to benchmark decoding. My reasoning for this is that decoding should, on the whole, only ever get faster, never slower. Encoding might justifiably get slower as algorithmic trade-offs are made.

I’m torn on the matter of whether to validate the decoding output during the benchmarking test. The case against validation says that computing framecrc’s is going to impact the overall benchmarking process; further, validation is redundant since that’s FATE’s main job. The case for validation says that since this will always be run on the same configuration, there is no need to worry about off-by-1 rounding issues; further, if a validation fails, that data point can be scrapped (which will also happen if a build fails) and will not count towards the overall trend. An errant build could throw off the performance data. Back on the ‘against’ side, that’s exactly what statistical methods like weighted moving averages are supposed to help smooth out.

I’m hoping that graphing this idea for all to see will be made trivial thanks do Google’s Visualization API.

The script would run continuously, waiting for new SVN commits. When it’s not busy with new code, it would work backwards through FFmpeg’s history to backfill performance data.

So, does this whole idea hold water?

If I really want to run this on every single commit, I’m going to have to do a little analysis to determine a reasonable average number of FFmpeg SVN commits per day over the past year and perhaps what the rate of change is (I’m almost certain the rate of commits has been increasing). If anyone would like to take on that task, that would be a useful exercise (‘svn log’, some text manipulation tools, and a spreadsheet should do the trick; you could even put it in a Google Spreadsheet and post a comment with a link to the published document).

16 thoughts on “Systematic Benchmarking Adjunct to FATE”

nine January 26, 2010 at 9:46 am

Regarding the validation of output, a moving average is more designed to remove noise in ‘periodic’ data sources (eg, if every seventh test failed). You probably just want to delete data points more than 2 standard deviations from the mean or something.

Regarding performance testing, maybe you could run daily/weekly performance runs on all your hardware via the existing FATE architecture? You wouldn’t get per-commit granularity, but it would catch any arch-specific regressions.

Multimedia Mike Post authorJanuary 26, 2010 at 9:53 am

“Standard deviation” is the part where my eyes glaze over. I think I will be better off making sure that the raw data is available via various formats and APIs and let the stats geeks have their way with it.

nine January 26, 2010 at 9:56 am

Here’s the raw commit data. Someone else can prettyify it. For the record, the “grep 20” gets rid of extraneous log lines the initial egrep didn’t filter out.

svn log | egrep ‘^r.*line’ | awk ‘{print $5}’ | cut -d- -f1-2 | grep 20 | uniq -c
289 2010-01
174 2009-12
163 2009-11
67 2009-10
138 2009-09
142 2009-08
54 2009-07
75 2009-06
92 2009-05
127 2009-04
346 2009-03
358 2009-02
187 2009-01
162 2008-12
182 2008-11
194 2008-10
171 2008-09
122 2008-08
213 2008-07
230 2008-06
320 2008-05
314 2008-04
192 2008-03
170 2008-02
380 2008-01
348 2007-12
302 2007-11
226 2007-10
366 2007-09
338 2007-08
273 2007-07
259 2007-06
253 2007-05
323 2007-04
482 2007-03
287 2007-02
290 2007-01
400 2006-12
836 2006-11
545 2006-10
399 2006-09
334 2006-08
403 2006-07
313 2006-06
2 2006-05
1 2006-06
195 2006-05
347 2006-04
309 2006-03
179 2006-02
245 2006-01
211 2005-12
185 2005-11
259 2005-10
294 2005-09
173 2005-08
285 2005-07
271 2005-06
296 2005-05
277 2005-04
172 2005-03
228 2005-02
339 2005-01
220 2004-12
239 2004-11
314 2004-10
298 2004-09
292 2004-08
190 2004-07
216 2004-06
160 2004-05
271 2004-04
83 2004-03
104 2004-02
197 2004-01
161 2003-12
206 2003-11
373 2003-10
233 2003-09
222 2003-08
158 2003-07
132 2003-06
193 2003-05
257 2003-04
257 2003-03
325 2003-02
507 2003-01
361 2002-12
316 2002-11
436 2002-10
361 2002-09
347 2002-08
244 2002-07
371 2002-06
328 2002-05
494 2002-04
527 2002-03
449 2002-02
525 2002-01
702 2001-12
627 2001-11
567 2001-10
221 2001-09
376 2001-08
177 2001-07
332 2001-06
243 2001-05
410 2001-04
243 2001-03
21 2001-02

Multimedia Mike Post authorJanuary 26, 2010 at 10:16 am

Neat. Thanks for doing that, nine. 289 commits in a month means an average of fewer than 10 per day. That gives me something to plan for.

nine January 26, 2010 at 10:41 am

Just realised creating a graph lets me avoid doing other work. Voila: http://img706.imageshack.us/img706/1438/mplayerprojectcommitsby.png

nine January 26, 2010 at 10:48 am

oops, those are mplayer numbers. Brain meltdown? I need a drink…

Here are ffmpeg’s numbers in graph form. Do you want the raw numbers? You can get them yourself with the above command. http://img6.imageshack.us/img6/3644/ffmpegprojectcommitsbym.png

Multimedia Mike Post authorJanuary 26, 2010 at 11:29 am

I’ve never seen a graph where the timeline decreases.

Anyway, those numbers do change the way I need to think about this. Thanks.

Carl Eugen Hoyos January 26, 2010 at 12:58 pm

I think VC1 should also get tested…

Carl Eugen

Multimedia Mike Post authorJanuary 26, 2010 at 1:09 pm

Yeah, I was thinking about VC1 as well. I just need to find a suitably high-def sample.

Multimedia Mike Post authorJanuary 26, 2010 at 2:39 pm

Ah, Microsoft HD Showcase; solves that problem.

Ben January 26, 2010 at 7:10 pm

The BBB files aren’t representative enough of modern encoding techniques, except the “AVI/MS MPEG-4 video/MP3 audio” one. The “Ogg/Theora video/Vorbis audio” was encoded with an pre-Thusnelda release which doesn’t use all the features of the format such as variable quantization (not sure if it matters much – probably not).

The “MP4/MPEG-4 part 2 video/AC3 audio” which is, despite the labeling, an AVI file, doesn’t use 8x8MV or B-frames (not to mention qpel or GMC but there aren’t used much in the wild either – B-frames are really important though). This one was encoded with ffmpeg’s default settings (= no features used, crap quality).

But the worst is the “MP4/H.264 video/AAC audio” file, encoded by the notoriously bad Apple encoder. It’s a Main Profile file (implying no 8×8 transform) which uses CAVLC (as opposed to CABAC) as the entropy encoder. It uses just one reference frame, no weighted prediction of any kind (and in fact a fixed PBPBPBP frame type pattern) and the deblocking filter is disabled for a ton of frames (as soon as the mean quantizer approaches the low 20s it’s turned off completely).

I don’t think you want to optimize for these kinds of files. If you can’t find anything else, you’re better off encoding your own.

Multimedia Mike Post authorJanuary 26, 2010 at 7:26 pm

@Ben: I was afraid of that. But I suppose if I want to do this, I would do well to get this part right. Thanks for the tips.

Mans January 27, 2010 at 2:18 pm

Regardless of the encoder settings, Big Buck Bunny is a bad video for benchmarking. As is typical for animations, it has fairly low detail and largely static backgrounds with motion restricted to a few small areas. This provides a rather atypical workload for the motion compensation.

Fruit January 28, 2010 at 5:41 am

Well, x264 has nice presets now, so oyu could make a representative file using preset medium or preset veryslow. I don’t know if you want to test cavlc too…

Alternatively, a sample of some illegitimate video, which is what “in the wild” means… or a piece of a bluray stream, since that is the official use of the codec…

Fruit January 28, 2010 at 6:31 am

Oh, one other thing. Effect of code changes on performance varies between cpus, K8 for example has generally slow SSE2, Pentium M too iirc, K7/K8/K10 have big L1 caches while Intel has smaller ones (some if Intel gains from code compacting, AMD can actually slow down, see here in x264: http://x264dev.multimedia.cx/?p=201).

Basically, the speed regressions don’t need to happen on all cpus.

P.S. It’s already late, with the ffh264 recently changes undergoing atm :D

Steven Robertson January 28, 2010 at 10:58 am

Why not do more exhaustive testing on every nth revision, for decently large n, and use an automated binary search to track down performance aberrations per-codec (and possibly repeat tests enough to attain statistical certainty, only when tracking down a regression)? Assuming you’ve got the requisite hardware, this should let you do broader testing across platforms without tying things up for too long.

Comments are closed.