I’ve had my nose deep in the multimedia mess for so long that I forget that people outside the field might not understand some of the ideas that those of us “skilled in the art” believe are intuitive. One such concept is that of manually programming in assembly language (colloquially referred to as ASM).
“But no one codes in ASM these days,” they exclaim. “Compilers are smart enough to do all the optimization you will ever need.” First, I would like to state that I place very little faith in compilers to always do the right thing, particularly during the optimization phase, but that’s neither here nor there. Allow me to explain why manual ASM programming is often done during multimedia programming.
First, it’s true: very few people write actual, usable applications purely in ASM. Where ASM comes in handy is in optimization in the form of invoking series of special instructions in ways that compilers wouldn’t be able to leverage.
That’s the short answer about why people still manually code some program components in ASM these days. The rest of this post will be devoted to explaining exactly how certain special instructions are useful in multimedia applications.
A lot of modern consumer-level CPUs (x86_32, x86_64, PPC) have special categories of instructions that are designed to perform the same operation on lots of data bits, all at the same time. Remember when Intel rolled out the Pentium MMX instructions? That’s what those were good for– performing lots of the same operation in parallel. Since then, there have been various upgrades to the concept, and from various other manufacturers (and some branching out, such as special floating point operations as well as cache control instructions). But the main idea that is useful in multimedia programming remains the same and it’s called “single instruction, multiple data” or SIMD; you execute one instruction and it operates on multiple data in parallel.
SIMD instructions are invaluable in multimedia programming because processing multimedia typically involves performing the same operation on every data element in a huge buffer. If you can do the same operation on 2 data elements, or 4, or 8, or 16 at a time, rather than just one at a time, that’s great.
Here’s a more tangible example: You have a large array of bytes. You have to add the quantity 5 to each byte. With the traditional load-do_math-store sequence, this looks like:
load next data byte into reg1 reg1 = reg1 + 5 store reg1 back to memory [repeat for each byte]
Even if you have a 32-bit CPU, you can’t effectively load 4 data bytes at a time and add 0x05050505 because if any individual sums exceed 255, the carry will spill into the next most significant byte. However, the Intel MMX provides 8 64-bit registers (really just the same as the FP regs, but that’s trivia). Depending on the SIMD instruction issued, a register may be treated as 8 8-bit elements, 4 16-bit elements, or 2 32-bit elements. Check out the parallel approach for solving the previous problem:
init mm1 (mmx register) as 0x0505050505050505 load next 8 bytes into mm2 mm2 = mm2 + mm1, treat data as 8 unsigned bytes store 8 bytes back to memory [repeat for each block of 8 bytes]
Powerful, huh? You’re not performing one addition at a time; rather, there are 8 happening in parallel. That’s something that a compiler generally can’t pick up on and optimize for you behind the scene (though I’ve heard legends that Intel’s compiler can perform such feats, which I would like to see). But there’s more:
- Intel’s SSE2 and the PowerPC’s AltiVec SIMD facilities provide 128-bit registers. Thus, this same process can operate on 16 bytes in parallel.
- The data doesn’t have to be treated as unsigned bytes; you can also ask it to be treated as signed bytes, or signed or unsigned words or doublewords. It all depends on the instruction issued
There’s also saturation. This goes by many names such as clamping and clipping. In the above example, if we want to make sure that none of the bytes go above 255 (and wrap around to 0 as a result), we must perform this tedious process for each byte:
next_byte = *buffer; if (next_byte + 5 > 255) next_byte = 255; else next_byte += 5; *buffer++ = next_byte;
Not so with SIMD instructions which typically have saturation modes. When you invoke the saturation variant of the add instruction, it automatically makes sure that none of the data elements go above the limit (255 for an unsigned byte or 127 for a signed byte).
What would be the practical application of adding a constant value to each byte in a buffer? This is one of the earliest examples I found related to SIMD programming. The example pertained to brightening the intensity information of an image. This is particularly pertinent to me since I remember working briefly with a image manipulation program that did not take saturation into account when brightening an image. It simply wrapped the values right around to 0 and kept adding. As demonstrated, using hand-coded SIMD ASM in this case would have yielded both correct and fast results.
What about applications in multimedia video codecs? The parallel saturated addition happens to come in handy in such situations. There are many mainstream video codecs which operate on 8×8 blocks of samples. Further, they have to decode one block and then add that block to a block which occurs in the previous frame. This turns out to be a fantastic use for parallel addition with saturation.
These are but a few simple applications for hand-coded ASM code. I may write more on the matter, if I am so inspired.
I feel almost ashamed to say this but this was really enlightening for me. Yes I have always heard the legends that hand coding assembly does wonders for speed, but for someone who has never done any assembly programming it was interesting to see where and how hand coding assembly makes things faster. Once again thanks for some interesting posts.
Thanks for validating what I laid out in the intro of the post, that there are people who might not be entirely up to speed on these concepts. Believe it or not, I felt a little silly writing the post because it felt like I may as well be writing “2 + 2 = 4”. That’s the kind of bubble I live in thanks to multimedia hacking. :-)
BTW, someone pointed out to me privately that I should take the opportunity of this post to dispel another myth of ASM coding: That hand-tooled ASM will always be faster than machine-optimized C. NOT true. For most general-purpose programming, just let the compiler worry about it. It does a good enough job and will probably optimize more than you would care to do by hand. Also, be wary of diminishing returns; you might be able to eek a little more performance out of a hand-optimized ASM routine, but it is probably not worth the effort vs. new features.
SSE2 and AltiVec both have C APIs too, of course. Altivec’s is very nice and recommended over ASM, SSE2 is pretty ugly.
I kind of wish ffmpeg didn’t use it so much. The CABAC decoder especially is somewhat weird and messes with the OS X profilers (not to mention that even icc refuses to compile it with frame pointer saving). But the vastly-improved gcc 4.3 isn’t going to be done anytime soon, and I don’t get to code for x86-64, so I guess we’re stuck with it.
And another kind of ASM usage. I once wrote EXE encryption system (not very strong or advanced) and used there assembler for both optimization and for dynamic programming (my program generated .ASM which was feed to NASM and resulted object was injected into EXE). Luckily (for me) it was not complete.