I found some notes I wrote nearly 4 years ago about how to encode a VP3 (and, by extension, Theora) intraframe. Where does the time go? This was back when I was hard at work on the native VP3 decoder for FFmpeg. An intraframe is referred to as a golden frame in the VP3 coding scheme. The notion of a golden frame differs slightly from that of a traditional intraframe in that the decoder retains one golden frame until the next golden frame and any interframe between golden frames A and B may refer back to golden frame A in addition to the interframe just prior.
In case anyone were to want to build a VP3 encoder from scratch, it would be prudent to start with just a golden frame encoder. This is the rough process that I perceive for encoding the golden frame:
- Pick a number, any number, between 0..63. Call it the quality_index and initialize the golden frame Y- and C-plane quantization matrices with them.
- Break up the Y, U, and V planes (oh yeah, convert the image to YUV 4:2:0 if it’s not already; also important: encode the image from bottom to top, not top to bottom) into 8×8 blocks of samples, but call them fragments, not blocks.
- Shove each fragment through the On2 discrete cosine transform. Do not use the DCT commonly used in JPEG/MPEG/H.26x codecs as that is not precisely the same as the one that On2 likes to use.
- Quantize the coefficients.
- Reduce the DC coefficients via the VP3 prediction method.
- Zig the coefficients. Zag them, too. Use the inverse zigzag order that you would for MPEG et al. (i.e., MPEG de-zigzag order = VP3 zigzag order).
- Write the bitstream header (3 bytes for a keyframe).
- Encode all of the coefficients into the bitstream.
Sounds easy enough, I know. However, that last step about encoding coefficients is grotesquely understated. VP3 doesn’t encode stuff fragment by fragment. First, it encodes all of the DC coefficients, followed by all of the 1st AC coefficients for all fragments, then the 2nd AC coefficients for all fragments, on up to the 63rd AC coefficients. It doesn’t follow a nice, neat, left->right fragment pattern, oh goodness, no. It follows a Hilbert pattern along individual superblocks in each plane. A superblock in the VP3 vernacular is 32×32 samples, or 4×4 fragments.
Encoding the coefficient bitstream is a matter of encoding the VLC representing one of 32 codes, followed by optional data depending on the code. The 32 codes specify whether a coefficient is coded by itself in the bitstream after the VLC, or if the coefficient ought to be one of the more common small coefficients like +/- 1 or 2, or if the next AC coefficients for the next n blocks should be set to zero, or if the current block being visited is EOB (no more non-zero coefficients)… etc. The possibilities are many– 32, in fact, as stated before.
However, there are 32 VLC tables hardcoded in the VP3 coding scheme, 16 each for DC and AC coefficients. Theora makes these 32 tables user-definable. Which table to use? It seems to me that the encoder needs to make one pass through the code stream in order to gather statistics about which table provides the optimal encoding for each. 4 bits specifying which DC VLC table and 4 more bits specifying AC VLC table are encoded in front of the main coefficient bitstream.
Why is the VP3 coding method this complicated? I have long suspected that the designers inserted all of these algorithmic oddities in a courageous effort to side-step patent claims. Just wait until you see what is involved in encoding an interframe.
See Also:
- Followup post about interframe encoding
- vp3-format.txt: My own textual description of VP3, never 100% completed.
- Theora_I_spec.pdf: The formal Theora I specification. More complete.
I once meddled with VP3 decoder and must say that you can do VP3 IDCT by performing standard IDCT and dividing coefficients by 4 (can’t remember – before or after). So DCT just differs by coefficient scaling.
For the most part, yes. In fact, with my first pass VP3 decoder for FFmpeg, I just used the typical IDCT facilities and a few other math tricks recommended by the guru (there was also the matter of shifting the coeffs into the right range, I believe). However, there was still a matter of DCT drift, little off-by-1 errors that could accumulate over frames. We initially imported the VP3 IDCT stuff to try to fix some other more serious errors (which turned out to be caused by incorrect quantizer selection).
Yep, VP3 just standardized on one particular integer approximation of the IDCT. That’s not so different from having to use XviD’s IDCT do decode XviD encodes with maximum quality…
I also agree that many features of VP3 seem to be different just for the sake of being different. I can, however, explain a technical reason behind the ordering of DCT coefficients in the bitstream:
The most obvious benefit of sorting coefficients by frequency is error resilience. If a frame is truncated or damaged part way though, you can decode the undamaged part to get a lower resolution version of the whole frame, as opposed to MPEG* which leaves you with no data at all in the bottom half of the image.
The less obvious reason is compression gains. I know of only one codec that supports both ordering methods, and that codec is JPEG. Baseline JPEG stores one block at a time, while progressive JPEG uses frequency order. (not exactly like VP3, though. JPEG uses fewer than 64 passes.) jpegtran can losslessly convert one to the other for easy comparison. The nominal purpose of progressive JPEG is to allow you to see a low resolution image while downloading the rest (hence the name). But it turns out that it also tends to compress smaller than baseline for a given quality. There’s more redundancy between two coefficients of the same frequency in neighboring blocks than there is between two coefficients of neighboring frequency in the same block.
I can assure you that compressibility has nothing whatsoever to do with the coefficient ordering. The are only two ways that Theora takes advantage of the fact that coefficients in the same frequency band are correlated. The first is that they’re coded with the same Huffman codebook. But it could just as easily switch from codebook to codebook as it codes a single block. The second is that it codes “EOB runs”, long runs of end-of-block flags starting at the same frequency, using a single token. But you could just as easily maintain a counter between blocks. In fact, Theora already has to maintain counters for the other direction: long zero runs within a block. The order coefficients are stored in the bitstream is almost completely independent of the method used to encode them. This holds true for other coding schemes as well (e.g., arithmetic coding could use the coefficients from neighboring blocks for context just as easily as other coefficients in the same block).