Executive Summary: Showcased by FFmpeg, Intel’s C Compiler beats gcc’s C compiler. Handily. Decisively. I stop just short of brutal dismemberment metaphors because that just seems so tasteless, and because I know there must be options to explore in order to improve gcc’s numbers.
Pursuant to my last post where I found results all over the map when comparing FFmpeg‘s performance when built with different compilers, with Intel’s C compiler (icc) barely edging out gcc 4.1.2, David Conrad recommended that I try building FFmpeg with all ASM and manual SIMD optimizations disabled. In doing so, the compilers would have a chance to really shine in optimizing plain C code for a computationally intensive — not to mention commonplace — task. And so I repeated the same test, only I configured the builds with these options:
./configure --disable-yasm --disable-mmx \ --disable-mmx2 --disable-sse --disable-ssse3
I also built static binaries with no swscale, if that makes any difference. After each build, I manually audited the resulting binary using the command:
objdump -d ffmpeg_g | grep movq
This method is predicated on the observation that x86 SIMD code blocks nearly always involve at least a movq (move quadword) instruction.
Then I did 2 runs back to back with each build. The results are thus:
It’s interesting to note that icc’s build tested positive for movq instructions– they appear to be generated by the compiler, not present due to FFmpeg code. If the compiler was smart enough to build a binary that uses SIMD where appropriate, I count that as fair game for this exercise. Note that I didn’t specify any specific CPU type to icc. Meanwhile, the optimization level for the gcc builds is cranked up to -O3 (same with icc).
I’m eager to hear how gcc’s numbers might be improved in this case (especially for the latest gcc versions). For reference, every one of these gcc compiler versions was built from source by me. Did I neglect to configure with some –turbo option? Also, fairness dictates that I field suggestions about how to coax icc into building an FFmpeg binary that further embarrasses gcc — and, by extension, free software — in this matter.
Hey, wanna hear something really creepy? Just as I was finishing this post, an apparently automated email arrived from Intel, asking for my feedback on icc.
You’re running that on an Intel CPU don’t you? Try running it on AMD instead. And by the way, -O3 sometimes produces worse code than -O2 for 64-bit systems for instance, and without proper -march it wouldn’t generate any SIMD by itself.
I have an old Perl script that tells you how many SIMD instructions and of which type are in a binary, but I’d have to see to find who wrote it and in case extend it because it’s quite old.
For gcc, definitely try adding -march=native (where available).
I am not sure for what CPU icc generates code by default, but reading the command line options I see no flags for MMX or SSE1, if it requires these _always_ it would have and unfair advantage vs. gcc which I think produces code for the 486.
And yes, icc does have a -fast option which does what -march=native does and a bit more.
Concerning -O3 being slow: IIRC gcc has had serious issues with inlining (mostly inlining too much), I thought that had improved somewhen along the 4.x series though.
I do hope you check that the ICC binary produces reasonable output though? As said before, it’s easy to be fast if you just (incorrectly) leave out half the code…
What about the results?
How do the different ffmpeg binaries compare to each other recoding some video and audio?
This is much more important for an end user like me.
I’d also be interested to see how the code produced by LLVM does.
@Bobby: Is FFmpeg known to compile under LLVM?
@Multimedia Mike: I don’t know. That’s something else I’m curious to see. I’ll probably give it a try in the next couple days if somebody doesn’t beat me to it.
ffmpeg doesn’t compile under llvm-gcc with inline asm on, but should work with it off.
gcc should be able to do a little autovectorization with appropriate -march -mtune, but not as many movqs.
And can you try gcc svn?
ffmpeg llvm benchmarks: http://laurovenancio.wordpress.com/2007/08/07/llvm-perf-tests/
With an Athlon64 X2 3800:
Configure options common to all configurations: –disable-amd3dnow –disable-amd3dnowext –disable-mmx –disable-mmx2 –disable-sse –disable-ssse3 –disable-yasm
Input file was divx5 video (652×356), mp3 audio, 95 minutes
ICC 10.1 20080801:
– options: –cc=icc –extra-cflags=-O3
– best time: 7m57.379s
– options: -O3 -march=athlon64
– best time: 8m17.021s
LLVM-GCC 2.4 (2.5 came out recently):
– options: -O3 -march=athlon64
– best time: 8m30.107s
And just for comparison, with the ffmpeg I had installed, which included all the hand-written asm, it only takes 6m13.722s
I don’t have the reference at hand but I remember at least older Intel C Compiler versions used to have a killswitch for AMD.
Basically it produced much bigger code, branching depending on the runtime features found on the CPU (SSE, MMX, and so on), checked with cpuid. When cpuid reported a vendor different from GenuineIntel, it branched to “no extension available”.
Fun isn’t it?
@Flameeyes: From the discussions I’ve read, it all depends on whose side of the story you believe. Apparently, Intel said it was to protect against running magical, possibly volatile code on non-blessed chips.
So there, you anti-Intel zealot. :-)
Weeeell when a hardware manufacturer comes to me with a compiler that “magically” runs code faster on his chips, but doesn’t on the ones from a rival, it really sounds like mafia speaking ;)
But I agree that GCC needs improvement, especially in the heuristics. I’m afraid I’m not geek enough to know about compilers to the point of hacking at them though :(
Just for the record: Until recently, I used to run icc compiled code (as long as that was possible with FFmpeg) only on AMD cpus and the code was always faster than with gcc.
icc defaults to something like –cpu=pentium3, so unfortunately, this test was not exactly fair, march=native would turn the advantage: Imo, –cpu=core2 would be fair.
Pingback: ARM compiler shoot-out | Hardwarebug