I found some notes I wrote nearly 4 years ago about how to encode a VP3 (and, by extension, Theora) intraframe. Where does the time go? This was back when I was hard at work on the native VP3 decoder for FFmpeg. An intraframe is referred to as a golden frame in the VP3 coding scheme. The notion of a golden frame differs slightly from that of a traditional intraframe in that the decoder retains one golden frame until the next golden frame and any interframe between golden frames A and B may refer back to golden frame A in addition to the interframe just prior.
In case anyone were to want to build a VP3 encoder from scratch, it would be prudent to start with just a golden frame encoder. This is the rough process that I perceive for encoding the golden frame:
- Pick a number, any number, between 0..63. Call it the quality_index and initialize the golden frame Y- and C-plane quantization matrices with them.
- Break up the Y, U, and V planes (oh yeah, convert the image to YUV 4:2:0 if it’s not already; also important: encode the image from bottom to top, not top to bottom) into 8×8 blocks of samples, but call them fragments, not blocks.
- Shove each fragment through the On2 discrete cosine transform. Do not use the DCT commonly used in JPEG/MPEG/H.26x codecs as that is not precisely the same as the one that On2 likes to use.
- Quantize the coefficients.
- Reduce the DC coefficients via the VP3 prediction method.
- Zig the coefficients. Zag them, too. Use the inverse zigzag order that you would for MPEG et al. (i.e., MPEG de-zigzag order = VP3 zigzag order).
- Write the bitstream header (3 bytes for a keyframe).
- Encode all of the coefficients into the bitstream.
Sounds easy enough, I know. However, that last step about encoding coefficients is grotesquely understated. VP3 doesn’t encode stuff fragment by fragment. First, it encodes all of the DC coefficients, followed by all of the 1st AC coefficients for all fragments, then the 2nd AC coefficients for all fragments, on up to the 63rd AC coefficients. It doesn’t follow a nice, neat, left->right fragment pattern, oh goodness, no. It follows a Hilbert pattern along individual superblocks in each plane. A superblock in the VP3 vernacular is 32×32 samples, or 4×4 fragments.
Encoding the coefficient bitstream is a matter of encoding the VLC representing one of 32 codes, followed by optional data depending on the code. The 32 codes specify whether a coefficient is coded by itself in the bitstream after the VLC, or if the coefficient ought to be one of the more common small coefficients like +/- 1 or 2, or if the next AC coefficients for the next n blocks should be set to zero, or if the current block being visited is EOB (no more non-zero coefficients)… etc. The possibilities are many– 32, in fact, as stated before.
However, there are 32 VLC tables hardcoded in the VP3 coding scheme, 16 each for DC and AC coefficients. Theora makes these 32 tables user-definable. Which table to use? It seems to me that the encoder needs to make one pass through the code stream in order to gather statistics about which table provides the optimal encoding for each. 4 bits specifying which DC VLC table and 4 more bits specifying AC VLC table are encoded in front of the main coefficient bitstream.
Why is the VP3 coding method this complicated? I have long suspected that the designers inserted all of these algorithmic oddities in a courageous effort to side-step patent claims. Just wait until you see what is involved in encoding an interframe.
See Also:
- Followup post about interframe encoding
- vp3-format.txt: My own textual description of VP3, never 100% completed.
- Theora_I_spec.pdf: The formal Theora I specification. More complete.