Category Archives: Programming

FFmpeg and Multiple Build Threads

I got bored today and decided to empirically determine how much FFmpeg compilation time can be improved by using multiple build threads, i.e., ‘make -jN’ where N > 1. I also wanted to see if the old rule of “number of CPUs plus 1” makes any worthwhile difference. The thinking behind that latter rule is that there should always be one more build job queued up ready to be placed on the CPU if one of the current build jobs has to access the disk. I think I first learned this from reading the Gentoo manuals. I didn’t find that it made a significant improvement. But then, Gentoo is for ricers.

FFmpeg being built with multiple threads on a 2x Core 2 Duo

FFmpeg being built with multiple threads on a Core 2 Duo

FFmpeg being built with multiple threads on a dual-core hyperthreaded Atom

I think the most interesting thing to observe about these graphs is the CPU time (the amount of time the build jobs are actually spending on the combined CPUs). The number is roughly steady for the Core 2 CPUs regardless of number of jobs while the Hyperthreaded Atom CPU sees a marked increase in total CPU time. Is that an artifact of Hyperthreading? Or maybe I just didn’t put together a stable testing methodology.

State of the Art Compiler Optimization

Felix von Leitner delivered a talk at the 2009 Linux Kongress about the state of the art in compiler optimization (link to PDF slides). Presentation slides by themselves are not a good way to understand a talk and it would be better to learn if video for the actual talk is posted somewhere. Compiler optimization (or lack thereof) is fairly important to FFmpeg developers.

The talk analyzes how LLVM, icc, MSVC, Sun C, and gcc generate fast code in this day and age. One basic theme I gathered is that coders should forgo clever C optimizations as they tend to be counterproductive. I wish I could believe that, but there was that recent episode where I optimized FFmpeg’s Theora decoder by removing structure dereferences. I’m sure that other performance-minded multimedia hackers will have other nits to pick with the broad generalizations in the presentation. I call your attention to the fighting words (which I have taken out of context since it’s such a fun quote) on slide 41: “Note: gcc is smarter than the video codec programmer on all platforms.” Further, slides 53-55 specifically call out madplay for inline ASM that allegedly didn’t improve efficiency vs. what the compiler could achieve with the raw C code.

On the whole, the findings are probably quite accurate for the kind of C code that most people need to write (e.g., “No need to write a >> 2 when you mean a/4!”).

Speaking of compilers, FATE now covers Intel’s 11.1 series C compiler for both 32- and 64-bit icc. I have also updated the stale snapshots of the gcc-svn for my machines (I still need to write a tool to do that for me automatically and continuously).

Optimizing Away Arrows

Google released the third version of their year-old Chrome browser this past week. This reminded me that they incorporate FFmpeg into the software (and thanks to the devs for making various fixes available to us). Chrome uses FFmpeg for decoding HTML5/video tag-type video and accompanying audio. This always makes me wonder, why would they use FFmpeg’s Theora decoder? It sucks. I should know; I wrote it.

Last year, Reimar discovered that the VP3/Theora decoder spent the vast majority of its time decoding the coefficient stream. He proposed a fix that made it faster. I got a chance to check out the decoder tonight and profile it with OProfile and FFmpeg’s own internal timer facilities. It turns out that the function named unpack_vlcs() is still responsible for 44-50% of the decoding time, depending on machine and sample file. This is mildly disconcerting considering the significant amount of effort I put forth to even make it that fast (it took a lot of VLC magic).

So a function in a multimedia program is slow? Well, throw assembly language and SIMD instructions at the problem! Right? It’s not that simple with entropy decoders.

Reimar had a good idea in his patch and I took it to its logical conclusion: Optimize away the arrows, i.e., structure dereferences. The function insists on repeatedly grabbing items out of arrays from a context structure. Thus, create local pointers to the same array and save a bunch of dereferences through each of the innumerable iterations.

Results were positive– both OProfile and the TSC-based internal counter showed notable improvements.

Ideas for further improvements: Multithreading is all the rage for video decoders these days. Unfortunately, entropy decoding is always a serial proposition. However, VP3/Theora is in a unique position to take advantage of another multithreading opportunity: It could call reverse_dc_prediction() in a separate thread after all the DC coefficients are decoded. Finally, an upside to the algorithm’s unorthodox bitstream format! According to my OProfile reports, reverse_dc_prediction() consistently takes around 6-7% of the decode time. So it would probably be of benefit to remove that from the primary thread which would be busy with the AC coefficients.

Taking advantage of multiple threads would likely help with the render_slice() function. One thing at a time, though. Wish me luck with presenting the de-dereferencing patch to the list.

OpenCL On The Horizon

Mac OS X 10.6, a.k.a. Snow Leopard, is slated for release at the end of this week. One of the most interesting features I have read about is OpenCL support, a parallelization framework.

So how about it? What kind of possibilities does this hold for something like FFmpeg? The pedagogical example in the Wikipedia article demonstrates partitioning a fast Fourier transform so that it can be handled as separate work units, possibly by separate CPUs. I doubt that it would make a (positive) difference to, e.g., split up all of the inverse transforms during a video frame decode.

I really can’t judge the spec by the one example. Perhaps I should, at the very least, read the overview slides available here.

Sometimes I think that it doesn’t help my development as a programmer and computer scientist that I view every single technological development that comes down the pike through the fairly narrow lens of multimedia hacking.

Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering

Category Archives: Programming

FFmpeg and Multiple Build Threads

State of the Art Compiler Optimization

Optimizing Away Arrows

OpenCL On The Horizon