State of the Art Compiler Optimization

Felix von Leitner delivered a talk at the 2009 Linux Kongress about the state of the art in compiler optimization (link to PDF slides). Presentation slides by themselves are not a good way to understand a talk and it would be better to learn if video for the actual talk is posted somewhere. Compiler optimization (or lack thereof) is fairly important to FFmpeg developers.

The talk analyzes how LLVM, icc, MSVC, Sun C, and gcc generate fast code in this day and age. One basic theme I gathered is that coders should forgo clever C optimizations as they tend to be counterproductive. I wish I could believe that, but there was that recent episode where I optimized FFmpeg’s Theora decoder by removing structure dereferences. I’m sure that other performance-minded multimedia hackers will have other nits to pick with the broad generalizations in the presentation. I call your attention to the fighting words (which I have taken out of context since it’s such a fun quote) on slide 41: “Note: gcc is smarter than the video codec programmer on all platforms.” Further, slides 53-55 specifically call out madplay for inline ASM that allegedly didn’t improve efficiency vs. what the compiler could achieve with the raw C code.

On the whole, the findings are probably quite accurate for the kind of C code that most people need to write (e.g., “No need to write a >> 2 when you mean a/4!”).

Speaking of compilers, FATE now covers Intel’s 11.1 series C compiler for both 32- and 64-bit icc. I have also updated the stale snapshots of the gcc-svn for my machines (I still need to write a tool to do that for me automatically and continuously).

7 thoughts on “State of the Art Compiler Optimization

  1. dconrad

    I wonder which video codec did a byte splat that way, I always thought that x*0x01010101 was more obvious (and was surprised that gcc turned it into 2x of add+shift on ARM, which is faster.)

    Also he claims that the cmov variant for abs() is worse than (sign ^ x) – sign, that might have been true on the P4 but not since.

    For the mad example, the key phrase is “# gcc 4.4 also does this like this, but only on x64 :-(” The inline asm is actually better than what most gccs generate (especially 3.x back when MAD was still being developed), even if it’s not optimal.

    Given some of the asm ffmpeg has for trivial compiler idiocies (64-bit multiplies and NEG_SSR32 for x>>(32-y) come to mind), it’ll be a long time before compilers are good enough to not worry about asm and other micro-opts ;)

  2. Reimar

    32×32 -> 64 bit multiplies or anything involving sizes larger than the native size is and probably always will be a mess.
    And WTF is this “No need to write a >> 2 when you mean a/4!”?
    The compiler just _can’t_ do that optimization unless it can prove that a will be >= 0 always, because otherwise those are different operations!
    In practice that means only if a is unsigned. And since there are often reasons to use int (e.g. “reserve” some values for special meanings) that means you have the choice between writing “(unsigned)a / 4” or “a >> 2”.
    I know which one I consider more messy.

  3. Reimar

    Btw, I think he forgot one of the most important issues with macros: multiple-evaluation, which can cause really ugly bugs.
    The MSVC example where it just adds memset is going to cause lots of fun in case someone either adds a DLL that defines a different memset or sets a breakpoint on memset during debugging and fails to see how his code could match the binary…
    About the vecorization: “impressive: the vectorized code checks and fixes the alignment first”… Yes, I found it rather impressive how totally that can kill performance (when there are only very few loop iterations and thus no vectorization is possible most of the time). This kind of stuff just _so_ needs some way to give the compiler the proper hints. And by the time you have the code to do that compiler-independent you often could have written it in asm for all major architectures.
    And the strength reduction can usually be done more efficiently if you know you have at least one element and a pointer argument instead of an array at a fixed location:
    ptr += len;
    len = -len;
    do {
    sumo += ptr[len];
    } while(++len);
    Should only need an add, inc and jmp – though some of the newer gcc versions seem to mess that up and transform it into a less efficient variant (ok, that’s a reason against hand-optimizing code, too: the fancier compiler will make any code, no matter how cleverly written, slow anyway).
    Also the splat example is an example of the compiler being really stupid, the variant with the shifts is definitely an obvious approach, it ideally should recognize and optimize it.
    In particular if you think of FFmpegs MKTAG: according to that if performance matters you would better write c*0x01010101 instead of MKTAG(c, c, c, c) (not that I think you ever need it, but that definitely actually is an example that you _can’t_ trust the compiler to figure it out).
    Beyond the obvious “don’t overdo optimization, and certainly not without testing, and occasionally retesting with current compilers” it seems to me the main point is “never ever trust the compiler. Nothing is too simple for it to mess up even when it handles some incredibly complex stuff perfectly”

  4. Jonathan Wilson

    The problem I have with GCC is that (due to its architecture) it cant be as tuned for x86 as MSVC or ICC is or as tuned for SPARC as Sun C is.

  5. Mans

    I’d like to point at my type-punning hacks for floats and the crazy speedups they gave on Cortex-A8.

  6. Mike

    That mad slide is plain foolish. Doing c fixed point muls on anything but a 64 bit machine is *slow*. Just because he ran it on a 64 bit machine doesn’t mean everyone does. And people with 64 bit x86 machines aren’t the target audience for codecs designed to run on machines without FPUs. Machines without FPUs are the target and those are rarely 64 bit . . .

  7. Mans

    Many 32-bit machines do have 32×32->64 multiplies. ARM does, as does MIPS. PPC32 can do it efficiently too, even if it takes more than one instruction. The problem is that gcc is too stupid to use those instructions.

Comments are closed.