When I was banging my head hard on FFmpeg’s VP3/Theora VLC decoder, I kept wondering if there’s a better way to do it. Libtheora shows us the way in that there obviously must be, since it can decode so much faster than FFmpeg.
In the course of brainstorming better VLC approaches, I wondered about the concept of parallelizing the decoding operation somehow. This sounds awfully crazy, I know, and naturally, it’s intractable to directly parallelize VLC decoding since you can’t know where item (n+1) begins in the bitstream until you have decoded item (n). I thought it might be plausible to make a pass through the bitstream decoding all of the tokens and ancillary information into fixed lenth arrays and then make a second pass in order to sort out the information in some kind of parallelized manner. Due to the somewhat context-adaptive nature of Theora’s VLCs, I don’t know if I can make a go of that.
Why can’t VLC decoding be done in parallel, either through SIMD or thread-level parallelization? I think the answer is clearly, “because the codecs weren’t designed for it”. Here’s the pitch: Create a codec bitstream format where the frame header contains indices pointing into the bitstream. For example, Sorenson Video 1 encodes the Y, U, and V planes in order and could be effectively decoded in parallel if only the frame header encoded offsets to the start of the U and V planes. Similarly, if a Theora frame header encoded offsets to certain coefficient groups in the header, that could facilitate multithreaded coefficient VLC decoding. Some other modifications would need to occur in the encoder, e.g., guaranteeing that no end-of-block runs cross coefficient group boundaries as they often do now.
Taking this a step farther, think about arithmetic/range coding. Decoding a bitstream encoded with such an algorithm notoriously requires a multiplication per value decoded. How about designing the bitstream in such a way that 2 or 4 multiplications can be done in parallel through any number of SIMD instruction sets?
Do these ideas have any plausibility? I have this weird feeling that if they do, they’re probably already patented. If they aren’t, well, maybe this post can serve as prior art.
The two-pass decoding of VLCs into a linear array and later sorting out the information is basically the optimization I’ve been talking about :) But it actually hurts parallelism since unless the sorting of tokens into coefficients is done immediately before IDCT, there isn’t really any advantage over the current scheme. So IDCT has to be strictly in hilbert order.
As for storing offsets, several codecs already do that more or less. Dirac does almost exactly what you describe: for each arith coded segment it prefixes it with the length, and each segment could be decoded in parallel. MPEG codecs have slices that can (except for deblocking with h.264 slices) be decoded entirely in parallel.
The main drawback is that parallelism in entropy coding always hurts compression efficiency, and unless you need low latency, parallelism at the frame level scales better.
Arithdecoding 8 decisions at once with simd is easy. The hard parts are (a) getting 8 bitstreams into one vector register, and (b) arranging the codec such that those decisions are mutually independent.
(a) is scatter/gather. Lacking hardware support for that, I suppose you could do the bitstream reading in scalar: somewhat less than 1 of those 8 bitstream readers on average needs to be updated per vector of decisions, but there’s overhead in finding out which ones.
(b) if those are what would be consecutive decisions in a conventional arithcoder, then you’ve ruled out variable-length binarization, and context modeling on the previous bits in the same token. If they’re different types of tokens in the same macroblock, then you will get some parallelization on average, but it’d be quite variable. And if they’re in completely different calling contexts, then you have a custom implementation of cooperative multithreading, plus cache thrashing as you switch between 8 macroblocks after decoding each bit.
One situation where (b) *is* possible, with merely a reorganization of bitstreams and cluttering of code but no change in which conditional probabilities are being exploited: h264’s 4×4 DCT residual coder puts each NNZ flag and each EOB flag in a separate bin, and the pattern of which decisions you’re going to need to parse is pretty simply planned, so you could do several of them at once. Of course, those EOBs are there to tell which decisions not to parse, but you can retroactively mask off some of the updates after finding out where you should have stopped. (Sign flags too, except that those shouldn’t have been sent through the arithcoder in the first place, since they’re not actually being compressed. h264 8×8 DCT shares some bins across multiple decisions, but you could change that without compromising compression.) One you’ve got the NNZs, DCT coefficient magnitude coding is still very sequential within a block, but you could transpose the problem and decode the magnitudes from multiple DCT blocks in the same MB at once… which loses context updating but keeps conditionalization. Then you can compensate for the decoder’s reduced context updating by encoding in 2 passes, whereby you pick an optimal set of initial context values for each GOP.
You can see it gets complicated, but it’s not a crazy idea.