Author Archives: Multimedia Mike

DCT PR

Some people think that multimedia compression is basically all discrete cosine transform (DCT) and little else.

2 years ago at LinuxTag, I gave a fairly general presentation regarding FFmpeg and open source multimedia hacking (I just noticed that the main page still uses a photo of me and my presentation). I theorized that one problem our little community has when it comes to attracting new multimedia hacking talent is that the field seems scary and mathematically unapproachable. I have this perception that this is what might happen when a curious individual wants to get into multimedia hacking:

I wonder how multimedia compression works?

Well, I’ve heard that everyone uses something called MPEG for multimedia compression.

Further, I have heard something about how MPEG is based around the discrete cosine transform (DCT).

Let’s look up what the DCT is, exactly…


Discrete cosine transform written out on a chalkboard
Clever photo cribbed from a blog actually entitled Discrete Cosine

At which point the prospective contributor screams and runs away from the possibility of ever being productive in the field.

Now, the original talk discussed how that need not be the case, because DCT is really a minor part of multimedia technology overall; how there are lots and lots of diverse problems in the field yet to solve; and how there is room for lots of different types of contributors.

The notion of DCT’s paramount importance in the grand scheme of multimedia compression persists to this day. While reading the HTML5 spec development mailing list, Sylvia Pfeiffer expressed this same line of thinking vis-à-vis Theora:

Even if there is no vendor right now who produces an ASIC for Theora, the components of the Theora codec are not fundamentally different to the components of other DCT based codecs. Therefore, AISCs [sic] that were built for other DCT based codecs may well be adaptable by the ASIC vendor to support Theora.

This prompted me to recall something I read in the MPEG FAQ a long time ago:

MPEG is a DCT based scheme?

The DCT and Huffman algorithms receive the most press coverage (e.g. “MPEG is a DCT based scheme with Huffman coding”), but are in fact less significant when compared to the variety of coding modes signaled to the decoder as context-dependent side information. The MPEG-1 and MPEG-2 IDCT has the same definition as H.261, H.263, JPEG.

A few questions later, the FAQ describes no less than 18 different considerations that help compress video data in MPEG; only the first one deals with transforms. Theora is much the same way. When I wrote the document about Theora’s foundation codec, VP3, I started by listing off all of the coding methods involved: DCT, quantization, run length encoding, zigzag reordering, predictive coding, motion compensation, Huffman entropy coding, and variable length run length Booleans. Theora adds a few more concepts (such as encoding the large amount of stream-specific configuration data).

I used to have the same idea, though: I was one of the first people to download On2’s VpVision package (the release of their VP3 code) and try to understand the algorithm. I remember focusing on the DCT and trying to find DCT-related code, assuming that it was central to the codec. I was surprised and confused to find that a vast amount of logic was devoted to simply reversing DC coefficient prediction. At the end of a huge amount of frame reconstruction code was a small, humble call to an IDCT function.

What I would like to get across here is that Theora is rather different than most video codecs, in just about every way you can name (no, wait: the base quantization matrix for golden frames is the same as the quantization matrix found in JPEG). As for the idea that most DCT-based codecs are all fundamentally the same, ironically, you can’t even count on that with Theora– its DCT is different than the one found in MPEG-1/2/4, H.263, and JPEG (which all use the same DCT). This was likely done in On2’s valiant quest to make everything about the codec just different enough from every other popular codec, which runs quite contrary to the hope that ASIC vendors should be able to simply re-use a bunch of stuff used from other codecs.

Sun OMS Has A Spec

A little over a year ago, Sun was making rumblings about a brand new video codec that they were hoping to design from the ground up using known-good (read: patent-unencumbered) coding algorithms. This was to be called the Open Media Stack (OMS). I wrote a post about, made an obligatory MultimediaWiki page about it, and then promptly forgot all about it.

Today, by way of a blog post by Opera’s Bruce Lawson describing why HTML5’s <video> tag is, well, stalled (to put it charitably), I learned that Sun’s OMS team has published at least 2 specs, the latest one being posted a few weeks ago on June 9, 2009. As he notes, the proposed Oracle buyout of Sun puts the OMS project’s status in limbo.

The spec page links to forum threads where interested parties can discuss issues in the spec. There aren’t many comments but the ones that exist seem to revolve around the codec’s articial resolution limitations. For my part, I wonder about how to encapsulate it into a container format for transport. The format specifies a sequence header that is 96 bits (12 bytes) in length, though there are provisions for (currently unused) extended data as well as free-form user data. The sequence header would be categorized as extradata in an AVI file or in a MOV file’s video stsd atom. Successive frames should be straightforward to store in a container format since the coding method only seems to employ intra-and inter-frames. Each frame begins with a header which specifies a 37-bit timestamp in reference to a 90 kHz clock. This allows for a recording that’s just over 1 week in length. It’s also one of those highly redundant items if this format were to be stuffed in a more generalized container format.

Anyway, the main video algorithm uses arithmetic and Golomb coding for its entropy coding; 8×8 pixel macroblocks which can either be an entire block unto themselves or subdivided into 4 4×4 sub-blocks; colorspace is YUV 4:2:0 planar; specifies bit-exact 2×2, 4×4, and 8×8 transforms and quantization procedures; spacial prediction using left, top-left, top, and top-right blocks; precisely-specified 1/4-pel motion compensation. All in all, it appears relatively simple; the 0.91 spec (annexes and all) weighs in at a mere 96 pages.

Naturally, there are no reference implementations or samples available. This got me to wondering about how one goes about creating a reference implementation of a codec (or a production implementation, for that matter). The encoder needs to come first so that it can generate samples. Only then can a decoder be written. Ideally, the initial encoder and decoder should be 2 totally different programs, written by 2 different people, though hopefully working closely together (speedy communication helps). There is wisdom in the FFmpeg community about not basing the primary encoder and decoder on the same base library after we reverse engineered one of the Real video codecs and found a fairly obvious bug that occurred in both sides of the codec.

I think I know one way to ensure that the encoder and decoder don’t share any pieces– write them in different computer languages.

I’m still wondering what kind of application would need to record video continuously for up to a week. How about a closed-circuit security camera? With a terabyte drive, it could store video for a week assuming a bitrate of 1.5 Mbits/sec. That’s roughly the same bitrate as the original MPEG-1 standard. If this coding method compresses more efficiently than MPEG-1, this might be a plausible application.

Left ARM vs. Right ARM

Måns went and got one of those Sheeva Plugs— the wall-wart form factor computer that is a self-contained ARM-based computer. Of course it’s already in service doing FATE duty. This is an ARMv5TE CPU which is in contrast the the ARMv7 series on the Beagle Board. This is why there are 2 blocks of ARM results on the FATE page.

In other FATE news, I activated 10 new tests tonight: v210, for the V210 10-bit YUV format; and 9 more fidelity range extension H.264 conformance vectors.

New H.264 Tests In FATE

I have just activated 18 new test specs for FATE. One is a test for one variant of the newly supported DPX format. The other 17 are for various samples in the fidelity range extension suite of test vectors, an extension of H.264 that FFmpeg has supported for some time. It should be noted that more samples from this suite should be forthcoming as soon as I finish downloading the whole thing (something I thought I had done a long time ago).