Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Archives:

Compiler Performance Profiling With FFmpeg

March 2nd, 2009 by Multimedia Mike

A Slashdot story named High Performance Linux Kernel Project discusses an effort — called LinuxDNA — to make the Linux kernel compilable with Intel’s C compiler (icc) in an effort to hopefully create a higher performing Linux kernel. The ensuing comments showcased a lot of back and forth about whether icc actually offers any performance gains over recent (or even ancient) gcc versions (“ICC really no longer has the performance lead that it once did over gcc.”). I was also curious about the claim that “it’s well known that gcc 2.95.3 generates much better code on a lot of platforms”; is that why we continually test 2.95.3 via FATE?

Yet another comment stated that, “We tried ICC on our simulator. The result: 8% slower than GCC. On intel chips.”. That’s when I realized that I’m in a position to offer some controlled testing using a CPU-intensive application: FFmpeg. At any given time, I have access to the latest builds compiled for 20 different configurations. This includes a copy built with icc 10.1.017.

So I ran some tests. Executive summary: icc finishes neck and neck with gcc 4.1.2 (a tiny bit ahead of gcc in my test), while both put most of the rest of the compilers in the test to shame, especially the latest gcc compilers. I have a chart to back up my claims, so there:

Followup: Be sure to see the results of this same exercise run without any manual ASM/SIMD optimizations.

Compiler performance when decoding MPEG-4 video and MP3 audio with FFmpeg

Small aside: I hope you appreciate that chart. You wouldn’t believe how long it took me to coerce OpenOffice.org to create it, nor how grotesquely volatile OOo 3 is on Mac OS X. In the end, the program didn’t play ball and I had to use Mac’s screenshot feature to capture the goods for publishing.

Methodology: I took a 104-minute movie that has been encoded with ISO MPEG-4 part 2 (a.k.a. DivX/XviD) video and MP3 audio and fed it through the following command:

$ time ffmpeg -i file.avi -f framecrc – > /dev/null

I used the ‘user’ output from the time prefix (out of the real, user, and sys times) which counts the approximate seconds that the process spent on the CPU. This should exclude I/O access and, really, probably just counts the number of 10ms time slices that the OS allocates to the process. I ran the test once for each compiler configuration, then ran through the configurations a second time and graphed the minimum time between the pair of runs for each configuration.

One day, I will have graphing working in FATE so that we can obtain continuous and historical performance data that will help us analyze trends, both in FFmpeg and in the compilers that build it.

Another comment from the Slashdot thread asserted that “it is simply healthy for the kernel to be compilable across more compilers,” to which another commenter challenged, “Prove it.” Again, I think I’m in a position to help here. While it may be more common for a test to break on all PowerPC configurations due to endian considerations, or for the build to break on the icc or gcc 2.95.3 configurations for reasons related to C99 arcana, there have been a few instances where FATE tests have inexplicably broken on very specific configurations. The latest example of this is when a recent code change in FFmpeg randomly caused the wc3movie-xan test spec to fail, but only on the Linux / x86_32 / gcc 4.2.4 configuration. Huh? Well, thanks to Vitor who promptly went to work with valgrind and found that the subsystem was doing some bad things in the first place and in a way that finally manifested on one configuration. (Incidentally, I’m pretty sure that the WC3 playback system was the first bit of code I ever contributed to FFmpeg.)

See Also:

Posted in FATE Server, Programming | 7 Comments »

7 Responses

  1. Kostya Says:

    When I fix RV3/4 B-frames issues you’ll have to update FATE tests for them. So you have probably less than six months to prepare for that.

    Oh, and here’s obligatory remainder: you still have Xan4 to finish.

  2. dconrad Says:

    It might be interesting to disable all of ffmpeg’s asm – most of the execution time in ffmpeg is spent in functions where all the compiler can optimize is the prologue and epilogue.

  3. Multimedia Mike Says:

    @dconrad: Good brainstorm. It will take a little while to set that up (build with all the compilers) but it should yield some fascinating results.

  4. Reimar Says:

    You forgot the “I can produce really fast but wrong code, too” quote. While ICC 10 32-bit seems to work well enough, ICC 11 64-bit seems to completely miscompile the H.264 decoder – and it is hard to find the issue since ICC does not have many options to set optimizations. Basically the only thing you can do is disable optimizations completely, but then it even leaves “if (0) …” code in, so FFmpeg won’t even compile…

  5. Falk Says:

    This seems to be a pretty serious performance regression from gcc 4.1.2 to 4.3.2. Could you perhaps try with e.g. oprofile whether it is attributable to a single or a few functions? It might be fixable. Also interesting would be to try a gcc 4.4 snapshot.

  6. Carl Eugen Hoyos Says:

    If you add -parallel to CFLAGS, icc will beat gcc more clearly.

    @Reimar: icc 10.1 64bit should still beat gcc…

  7. Bernd Says:

    in your benchmark is see gcc4.2 and above is lots slower.what CPU you have, memory dual channel or not ?.its known that the option -funswitch-loops (used in -O3) generate very big code, that can cause many 1. level cache misses on older CPU.

    maybe you can run the ffmpeg test with that options in gcc 4, if thats faster ?

    -O2 -finline-functions -fgcse-after-reload -fpredictive-common