Monthly Archives: April 2007

The Legend Of Hilbert

I’ve been wanting to learn how to use a basic vector drawing program for some time now for the purpose of illustrating certain codec concepts more concretely. Sure, this will be for the benefit of others who are curious about the craft. But mostly, I do it for me because, well… me like pictures.

Behold, my first vector drawing, constructed using OpenOffice’s Draw program:

When I was first reverse engineering an English language description of the VP3 format and implementing a new decoder for FFmpeg, I figured out the curious pattern that the codec uses to traverse 4×4 fragments (blocks of 8×8 samples) within a VP3 superblock. I posted to the theora-dev mailing list asking if the pattern struck anyone as familiar. Personally, the pattern reminded me of playing the original NES The Legend of Zelda title, sort of like a pattern for traversing rooms in a dungeon. In fact, early iterations of my decoder used the identifier zelda[].

However, someone on the list identified it as resembling a Hilbert curve, discovered by some famous math dude. One of the codec’s designers chimed in on the list and stated that he had never even heard of Hilbert and that the traversal pattern was chosen to meet certain criteria. Any resemblance to the Hilbert curve was to be considered strictly coincidental.

Looking back on that old mailing list traffic, and taking a good look at the actual Hilbert curve from the link above, I may have made a mistake in using the term “Hilbert pattern” to describe the traversal sequence pictured above. It’s a little late now to change it back to “Zelda pattern”– Google demonstrates that the first term sort of caught on for VP3/Theora-related matters.

ATRAC3 Decoder

Ever so quietly, a new open source ATRAC3 decoder implementation has been slipped into FFmpeg. This decoder handles atrc data inside of RealMedia files or in WAV files.

Thanks to Benjamin Larsson and Maxim Poliakovski for their diligent work on this, as well as the guru for his tireless reviewing efforts and uncompromising code quality standards.

RealAudio samples here and WAV samples here.

About The VP3 Interframe Encoding

I should do a followup to the VP3 golden frame encoding brainstorm while this stuff is still fresh on the brain. Let’s talk about a possible approach for encoding VP3 (and again, by extension, Theora) interframes. Along the way, I’ll discuss the parts that (I hope) can be handled by FFmpeg’s internal facilities.

A VP3 golden frame only encodes a header followed by a coefficient bitstream. An interframe contains a header, several segments describing which superblocks, macroblocks, and fragments in the frame are coded and how, a segment for motion vector data, and finally, the coefficient bitstream. Note that the interframe is concerned with the notion of a macroblock — 2×2 Y fragments + 1 U fragment + 1 V fragment, the same as the traditional JPEG/MPEG concept — whereas the golden frame does not care about macroblocks. This is because motion vectors operate on a per-macroblock basis.

Rough outline– for each macroblock in the interframe, hand the macroblock over to FFmpeg’s libavcodec facility to work its motion estimation magic. I may be making a huge assumption here, but I’m hoping that I can pass lavc a macroblock along with 1 or 2 reference frames (previous frame and golden frame) and ask it to use its selected ME algorithm to search on a half-pel grid and find the best coding mode. The macroblock could be unchanged from the previous frame, or from the golden frame. It could use a fragment offset with a motion vector based on the previous frame or the golden frame. It could also reference a fragment from the previous frame but using one of the last 2 motion vectors. In the most complex case, the macroblock could use 4 separate motion vectors, one for each Y block, while all 4 are averaged together for the 2 C planes. And if nothing else will do, it could be declared that the macroblock needs to be intracoded, just like in a golden frame. One more thing, though– not all of the fragments in the macroblock have to be coded. The encoder can decide that a fragment is similar enough to the same position in the previous frame to warrant leaving it alone. But if a fragment is coded, it must go along with the same coding mode as the other coded fragments in the same macroblock.

Of course, VP3, like many other codecs, does not require exact matches for motion estimation. Instead, find the best possible block and code the residual difference. Through this process, the encoder will be tracking motion vectors and coding modes for each macroblock. For the 6 constituent fragments of the macroblock, if coded, perform the transform on either the raw samples or the residual, then the zigzagging and DC reduction processes as outlined in the golden frame method. Then…

…it’s time to pack it all up into a bitstream.

First, write out the frame header (it’s only a single byte this time). Then, pack information about the coding status of each superblock in the frame. A superblock can either be fully coded (each fragment changed), partially coded (some fragments changed), or not coded at all (entire superblock is copied from previous frame). First, pack all of the partially-coded superblocks. Any remaining superblocks that are not partially coded must, by process of elimination, be either fully coded or not coded at all. Pack information about whether the remaining blocks are fully or non-coded. Then, if any of the superblocks are partially coded, pack information about which fragments inside each superblock are coded (remember the Hilbert pattern for superblock traversal).

Next up is the macroblock coding mode information. Similar to the process for finding the optimal Huffman tables for VLC coding, some statistics must be gathered for macroblock coding modes because there are a number of different “alphabets” (as the VP3 scheme calls them) which arrange the coding modes in different orders within a list. The modes at the front of the list take fewer bits to code than the modes at the end of the list. Alternatively, if there is a more or less even distribution, the encoder can specify that each coding mode should be encoded with 3 bits (8 possible modes).

Then there are the motion vectors. Nothing too fancy here; this is probably the most straightforward segment of the bitstream encoding. Just march along the macroblocks and if the coding modes demand any motion vectors (new motion vectors, not referring to the motion vectors used for previous blocks), encode those with the variable bit scheme that VP3 uses for motion vectors.

Finally, there is the coefficient data. Pack it up the same as would be done for a golden frame (stated with the same deceptive simplicity as in the previous post on the matter).

Further Reading:

To Encode A VP3 Golden Frame: The first part in this epic series.
vp3-format.txt: My own textual description of VP3, never 100% completed.
Theora_I_spec.pdf: The formal Theora I specification. More complete.

To Encode A VP3 Golden Frame

I found some notes I wrote nearly 4 years ago about how to encode a VP3 (and, by extension, Theora) intraframe. Where does the time go? This was back when I was hard at work on the native VP3 decoder for FFmpeg. An intraframe is referred to as a golden frame in the VP3 coding scheme. The notion of a golden frame differs slightly from that of a traditional intraframe in that the decoder retains one golden frame until the next golden frame and any interframe between golden frames A and B may refer back to golden frame A in addition to the interframe just prior.

In case anyone were to want to build a VP3 encoder from scratch, it would be prudent to start with just a golden frame encoder. This is the rough process that I perceive for encoding the golden frame:

Pick a number, any number, between 0..63. Call it the quality_index and initialize the golden frame Y- and C-plane quantization matrices with them.
Break up the Y, U, and V planes (oh yeah, convert the image to YUV 4:2:0 if it’s not already; also important: encode the image from bottom to top, not top to bottom) into 8×8 blocks of samples, but call them fragments, not blocks.
Shove each fragment through the On2 discrete cosine transform. Do not use the DCT commonly used in JPEG/MPEG/H.26x codecs as that is not precisely the same as the one that On2 likes to use.
Quantize the coefficients.
Reduce the DC coefficients via the VP3 prediction method.
Zig the coefficients. Zag them, too. Use the inverse zigzag order that you would for MPEG et al. (i.e., MPEG de-zigzag order = VP3 zigzag order).
Write the bitstream header (3 bytes for a keyframe).
Encode all of the coefficients into the bitstream.

Sounds easy enough, I know. However, that last step about encoding coefficients is grotesquely understated. VP3 doesn’t encode stuff fragment by fragment. First, it encodes all of the DC coefficients, followed by all of the 1st AC coefficients for all fragments, then the 2nd AC coefficients for all fragments, on up to the 63rd AC coefficients. It doesn’t follow a nice, neat, left->right fragment pattern, oh goodness, no. It follows a Hilbert pattern along individual superblocks in each plane. A superblock in the VP3 vernacular is 32×32 samples, or 4×4 fragments.

Encoding the coefficient bitstream is a matter of encoding the VLC representing one of 32 codes, followed by optional data depending on the code. The 32 codes specify whether a coefficient is coded by itself in the bitstream after the VLC, or if the coefficient ought to be one of the more common small coefficients like +/- 1 or 2, or if the next AC coefficients for the next n blocks should be set to zero, or if the current block being visited is EOB (no more non-zero coefficients)… etc. The possibilities are many– 32, in fact, as stated before.

However, there are 32 VLC tables hardcoded in the VP3 coding scheme, 16 each for DC and AC coefficients. Theora makes these 32 tables user-definable. Which table to use? It seems to me that the encoder needs to make one pass through the code stream in order to gather statistics about which table provides the optimal encoding for each. 4 bits specifying which DC VLC table and 4 more bits specifying AC VLC table are encoded in front of the main coefficient bitstream.

Why is the VP3 coding method this complicated? I have long suspected that the designers inserted all of these algorithmic oddities in a courageous effort to side-step patent claims. Just wait until you see what is involved in encoding an interframe.

See Also:

Followup post about interframe encoding
vp3-format.txt: My own textual description of VP3, never 100% completed.
Theora_I_spec.pdf: The formal Theora I specification. More complete.

Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering

Monthly Archives: April 2007

The Legend Of Hilbert

ATRAC3 Decoder

About The VP3 Interframe Encoding

To Encode A VP3 Golden Frame