Monthly Archives: February 2010

Designing a Codec For Parallelized Entropy Decoding

When I was banging my head hard on FFmpeg’s VP3/Theora VLC decoder, I kept wondering if there’s a better way to do it. Libtheora shows us the way in that there obviously must be, since it can decode so much faster than FFmpeg.

In the course of brainstorming better VLC approaches, I wondered about the concept of parallelizing the decoding operation somehow. This sounds awfully crazy, I know, and naturally, it’s intractable to directly parallelize VLC decoding since you can’t know where item (n+1) begins in the bitstream until you have decoded item (n). I thought it might be plausible to make a pass through the bitstream decoding all of the tokens and ancillary information into fixed lenth arrays and then make a second pass in order to sort out the information in some kind of parallelized manner. Due to the somewhat context-adaptive nature of Theora’s VLCs, I don’t know if I can make a go of that.

Why can’t VLC decoding be done in parallel, either through SIMD or thread-level parallelization? I think the answer is clearly, “because the codecs weren’t designed for it”. Here’s the pitch: Create a codec bitstream format where the frame header contains indices pointing into the bitstream. For example, Sorenson Video 1 encodes the Y, U, and V planes in order and could be effectively decoded in parallel if only the frame header encoded offsets to the start of the U and V planes. Similarly, if a Theora frame header encoded offsets to certain coefficient groups in the header, that could facilitate multithreaded coefficient VLC decoding. Some other modifications would need to occur in the encoder, e.g., guaranteeing that no end-of-block runs cross coefficient group boundaries as they often do now.

Taking this a step farther, think about arithmetic/range coding. Decoding a bitstream encoded with such an algorithm notoriously requires a multiplication per value decoded. How about designing the bitstream in such a way that 2 or 4 multiplications can be done in parallel through any number of SIMD instruction sets?

Do these ideas have any plausibility? I have this weird feeling that if they do, they’re probably already patented. If they aren’t, well, maybe this post can serve as prior art.

Profiling and Optimizing Theora

Because I have a short memory, I wanted to write down some of the knowledge and wisdom I’ve collected in the past few months as I have been working on optimizing FFmpeg’s VP3/Theora decoder.

Profiling Methods
These are some of the general tools:
Continue reading →

30-hour Do-nothing Build

I have a habit of prepending ‘time’ to all of my ‘make’ commands in order to keep a rough estimate of how long build jobs take.

Adhering to this custom, I performed a ‘make’ command on a project that didn’t actually require any rebuilding. So how does the following happen?

$ time make -j5

[...]

real    1770m35.893s
user    0m12.408s
sys     0m11.692s

Answer: The machine (virtual machine, actually) had just been started, had a grossly out-of-sync clock, and must have synced to the time server during that narrow window that the build was occurring:

make[2]: Warning: File `...' has modification time 1.8e+04 s in the future
make[2]: warning:  Clock skew detected.  Your build may be incomplete.

Security Memory

I dug up this old security alert. It’s very dear to me in that I’m directly responsible for the security problem outlined. Whenever I feel like my work doesn’t matter, I just have to remind myself that I have written code that has become widespread enough that it warrants security notices. Many programmers likely go their whole career without making that kind of impact. (That kind of positive spin might be similar to not knowing or caring about the difference between positive and negative attention.)

For the curious, I wrote an AIFF demuxer (among many others) for the xine project. For some reason, I allocated a static buffer of 100 bytes on the stack and proceeded to read a number of bytes from user input, a number that was also determined by the same user input. Big no-no, and I really don’t know what I was thinking; hardcoded, arbitrary constants (char buffer[100]) aren’t usually my style. After that was found, I audited the rest of my demuxers for similar mistakes and found none. It may seem like this would only be a problem if a user directly loaded a malicious file into xine. However, since AIFF has a MIME type, and because there was a Mozilla plugin version of xine, it would have been possible to send a malicious AIFF through a web page.

The reason I was reflecting on this was due to a major security problem I found in FATE recently as I was investigating another problem. It has to do with the data logging script that receives FFmpeg build and test information from FATE clients. I’ll let my commit message to my private git repository tell the tale:

    Get rid of mind-boggling security hazard that actually prints out the
    user's actual hash key when passed an invalid hash. This was obviously
    added for debugging purposes and was only triggered if a user had access
    to insert data for a particular configuration.

If an attacker knew a valid username, the system would cheerfully reveal the corresponding hash key if the HMAC failed. Using this vector, an attacker could have polluted the FATE database with loads of bad data. Not a huge deal in the grand scheme of things. But given that this is the only attack that the system is trying to guard against, a total failure in context.

Honestly, sometimes I can’t believe people let me anywhere near a programming environment.

One last — and fascinating — note about that AIFF exploit: It was the result of an infamous university course (perhaps this one?) given by D. J. Bernstein in which students were required to find 10 security holes in open source software programs during the term. Reportedly, all of the students failed the class since none actually found 10 holes. I don’t know if the class was ever held again.

Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering

Monthly Archives: February 2010

Designing a Codec For Parallelized Entropy Decoding

Profiling and Optimizing Theora

30-hour Do-nothing Build

Security Memory