Category Archives: Programming

ASM: What It Can Do For You

I’ve had my nose deep in the multimedia mess for so long that I forget that people outside the field might not understand some of the ideas that those of us “skilled in the art” believe are intuitive. One such concept is that of manually programming in assembly language (colloquially referred to as ASM).

“But no one codes in ASM these days,” they exclaim. “Compilers are smart enough to do all the optimization you will ever need.” First, I would like to state that I place very little faith in compilers to always do the right thing, particularly during the optimization phase, but that’s neither here nor there. Allow me to explain why manual ASM programming is often done during multimedia programming.

First, it’s true: very few people write actual, usable applications purely in ASM. Where ASM comes in handy is in optimization in the form of invoking series of special instructions in ways that compilers wouldn’t be able to leverage.

That’s the short answer about why people still manually code some program components in ASM these days. The rest of this post will be devoted to explaining exactly how certain special instructions are useful in multimedia applications.

A lot of modern consumer-level CPUs (x86_32, x86_64, PPC) have special categories of instructions that are designed to perform the same operation on lots of data bits, all at the same time. Remember when Intel rolled out the Pentium MMX instructions? That’s what those were good for– performing lots of the same operation in parallel. Since then, there have been various upgrades to the concept, and from various other manufacturers (and some branching out, such as special floating point operations as well as cache control instructions). But the main idea that is useful in multimedia programming remains the same and it’s called “single instruction, multiple data” or SIMD; you execute one instruction and it operates on multiple data in parallel.

SIMD instructions are invaluable in multimedia programming because processing multimedia typically involves performing the same operation on every data element in a huge buffer. If you can do the same operation on 2 data elements, or 4, or 8, or 16 at a time, rather than just one at a time, that’s great.

Here’s a more tangible example: You have a large array of bytes. You have to add the quantity 5 to each byte. With the traditional load-do_math-store sequence, this looks like:

  load next data byte into reg1
  reg1 = reg1 + 5
  store reg1 back to memory
  [repeat for each byte]

Even if you have a 32-bit CPU, you can’t effectively load 4 data bytes at a time and add 0x05050505 because if any individual sums exceed 255, the carry will spill into the next most significant byte. However, the Intel MMX provides 8 64-bit registers (really just the same as the FP regs, but that’s trivia). Depending on the SIMD instruction issued, a register may be treated as 8 8-bit elements, 4 16-bit elements, or 2 32-bit elements. Check out the parallel approach for solving the previous problem:

  init mm1 (mmx register) as 0x0505050505050505
  load next 8 bytes into mm2
  mm2 = mm2 + mm1, treat data as 8 unsigned bytes
  store 8 bytes back to memory
  [repeat for each block of 8 bytes]

Powerful, huh? You’re not performing one addition at a time; rather, there are 8 happening in parallel. That’s something that a compiler generally can’t pick up on and optimize for you behind the scene (though I’ve heard legends that Intel’s compiler can perform such feats, which I would like to see). But there’s more:

  • Intel’s SSE2 and the PowerPC’s AltiVec SIMD facilities provide 128-bit registers. Thus, this same process can operate on 16 bytes in parallel.
  • The data doesn’t have to be treated as unsigned bytes; you can also ask it to be treated as signed bytes, or signed or unsigned words or doublewords. It all depends on the instruction issued

There’s also saturation. This goes by many names such as clamping and clipping. In the above example, if we want to make sure that none of the bytes go above 255 (and wrap around to 0 as a result), we must perform this tedious process for each byte:

  next_byte = *buffer;
  if (next_byte + 5 > 255)
    next_byte = 255;
  else
    next_byte += 5;
  *buffer++ = next_byte;

Not so with SIMD instructions which typically have saturation modes. When you invoke the saturation variant of the add instruction, it automatically makes sure that none of the data elements go above the limit (255 for an unsigned byte or 127 for a signed byte).

What would be the practical application of adding a constant value to each byte in a buffer? This is one of the earliest examples I found related to SIMD programming. The example pertained to brightening the intensity information of an image. This is particularly pertinent to me since I remember working briefly with a image manipulation program that did not take saturation into account when brightening an image. It simply wrapped the values right around to 0 and kept adding. As demonstrated, using hand-coded SIMD ASM in this case would have yielded both correct and fast results.

What about applications in multimedia video codecs? The parallel saturated addition happens to come in handy in such situations. There are many mainstream video codecs which operate on 8×8 blocks of samples. Further, they have to decode one block and then add that block to a block which occurs in the previous frame. This turns out to be a fantastic use for parallel addition with saturation.

These are but a few simple applications for hand-coded ASM code. I may write more on the matter, if I am so inspired.

Regarding The Literature

I journeyed to the bookstore today in search of an O’Reilly pocket-sized Python reference, in an effort to tip the balance toward the positive in my newfound love/hate relationship with the Python programming language. I found what I was after, and on prominent display. I had not perused the computer book aisle in quite some time so I took a moment to look around. Every current and trendy computer language and development fad was well-represented, including a few of which I was previously unaware. That’s when it dawned on me how hard it is to find a simple book on the C programming language in these sections.

Maybe C just isn’t good for selling books.


Stack of books

This episode reminded me of the difference I observed long ago about the differences in computer book selections at bookstores vs. academic libraries vs. public libraries. A bookstore will stock thick, vastly expensive tomes covering whatever the latest hot computing fad or language happens to be (wait for the fad to blow over and in 6 months the book will be less than $10 on the clearance table). Rewind 10 years to 1996 when Java was taking off in a big way. I think the local Barnes & Noble shop had an entire section devoted to the language. And I seem to remember that every one of the books was essentially the same: A few chapters discussing the basics of the language, with the remaining 4/5 of the book devoted to a verbatim reprint of the official Java language and API reference that was freely available online.

An academic library, such as the one found at your local technical university, will stock a few of the fad books about specialized skills but will feature far more texts on fundamental and advanced computer science theory (think “general theory of fishing” vs. “learn bass fishing in 3 days!”). A community public library, in my experience, will have a decent mix of both types of books.

Variable Declaration Guidelines

Back in 2000, I came across this Advogato article about proper coding guidelines for the coming wave of 64-bit machines. The most interesting part, I thought, was comment #2 (“C is portable, if you let it be”) which offers some very sane guidelines for declaring variable types to just allow the compiler to do its job effectively. This is why I usually just declare int’s for numbers rather than uint32_t’s everywhere. There is often no reason to try to force particular types.

Don’t think that you’re saving space by declaring a uint8_t instead of an int– chances are that you aren’t. I’ve disassembled enough C code compiled into 32-bit x86 machine code to know that a compiler will usually allocate 32 bits for that 8-bit variable. In fact, here is a small piece of code to drive the point home:

stack.c:

Compile with: gcc -Wall stack.c -o stack
Disassemble with: objdump -d -Mintel stack
Key parts:

080483a0 < main >:
 80483a0:   55                  push   ebp
 80483a1:   89 e5               mov    ebp,esp
 80483a3:   83 ec 08            sub    esp,0x8
 80483a6:   83 e4 f0            and    esp,0xfffffff0
 80483a9:   b8 00 00 00 00      mov    eax,0x0
 80483ae:   29 c4               sub    esp,eax
 80483b0:   e8 07 ff ff ff      call   80482bc < random @plt >
 80483b5:   88 45 ff            mov    BYTE PTR [ebp-1],al
 80483b8:   66 0f be 45 ff      movsx  ax,BYTE PTR [ebp-1]
 80483bd:   40                  inc    eax
 80483be:   66 89 45 fc         mov    WORD PTR [ebp-4],ax

Notice that, despite strictly needing only 3 bytes of local variable storage, 8 bytes were allocated from the stack. 32-bit machines like the i386 really, really like dealing with 32-bit quantities.

Evaluating Alternate Build Systems

Even though I am on record as expressing devotion to the Autotools suite, I am not averse to evaluating alternatives. Mostly, I’m interested in a competent build system that can take care of the difficult and tedious stuff pertaining to a build such as dependencies and configuration. I acknowledge that Autotools embody a fair amount of complexity and arcana. The two top contenders to plausibly compete for Autotools’ title appear to be SCons and CMake.


Components

A good baseline for evaluating the capabilities of an alternative is to find a limitation of your current solution and then figure out if the alternative can do that AND everything that the current solution can do. For example, on one of my software projects, I really appreciate that the current Autotools-based solution can:

  • automatically keep track of dependencies
  • manage multiple build targets
  • create multiple build configurations in separate directories, working from the same source tree

But now I need some very fine tweaking of certain build settings, such as being able to static link a particular version of libstdc++ to a binary. I don’t know if any of the common build systems support this without some very serious hacking.

Here is a blog post from someone who has struggled with the very same issues and was able to solve the problem with a hand-crafted Makefile: G Plus Plus Minus One. I have managed to achieve the correct results from the command line. But trying to hack Makefile.am to do the same always ends up with a roundabout veto by the Autotools (i.e., the tools fall back on their preferred method of linking).

Of course, it would be really sweet if I could modify my existing autotools setup to do what I need. I am still diligently researching this possibility. I certainly do not wish to re-tool the whole build system into a hand-crafted, manually maintained Makefile.