Some people think that multimedia compression is basically all discrete cosine transform (DCT) and little else.
2 years ago at LinuxTag, I gave a fairly general presentation regarding FFmpeg and open source multimedia hacking (I just noticed that the main page still uses a photo of me and my presentation). I theorized that one problem our little community has when it comes to attracting new multimedia hacking talent is that the field seems scary and mathematically unapproachable. I have this perception that this is what might happen when a curious individual wants to get into multimedia hacking:
I wonder how multimedia compression works?
Well, I’ve heard that everyone uses something called MPEG for multimedia compression.
Further, I have heard something about how MPEG is based around the discrete cosine transform (DCT).
Let’s look up what the DCT is, exactly…
Clever photo cribbed from a blog actually entitled Discrete Cosine
At which point the prospective contributor screams and runs away from the possibility of ever being productive in the field.
Now, the original talk discussed how that need not be the case, because DCT is really a minor part of multimedia technology overall; how there are lots and lots of diverse problems in the field yet to solve; and how there is room for lots of different types of contributors.
The notion of DCT’s paramount importance in the grand scheme of multimedia compression persists to this day. While reading the HTML5 spec development mailing list, Sylvia Pfeiffer expressed this same line of thinking vis-à-vis Theora:
Even if there is no vendor right now who produces an ASIC for Theora, the components of the Theora codec are not fundamentally different to the components of other DCT based codecs. Therefore, AISCs [sic] that were built for other DCT based codecs may well be adaptable by the ASIC vendor to support Theora.
This prompted me to recall something I read in the MPEG FAQ a long time ago:
MPEG is a DCT based scheme?
The DCT and Huffman algorithms receive the most press coverage (e.g. “MPEG is a DCT based scheme with Huffman coding”), but are in fact less significant when compared to the variety of coding modes signaled to the decoder as context-dependent side information. The MPEG-1 and MPEG-2 IDCT has the same definition as H.261, H.263, JPEG.
A few questions later, the FAQ describes no less than 18 different considerations that help compress video data in MPEG; only the first one deals with transforms. Theora is much the same way. When I wrote the document about Theora’s foundation codec, VP3, I started by listing off all of the coding methods involved: DCT, quantization, run length encoding, zigzag reordering, predictive coding, motion compensation, Huffman entropy coding, and variable length run length Booleans. Theora adds a few more concepts (such as encoding the large amount of stream-specific configuration data).
I used to have the same idea, though: I was one of the first people to download On2’s VpVision package (the release of their VP3 code) and try to understand the algorithm. I remember focusing on the DCT and trying to find DCT-related code, assuming that it was central to the codec. I was surprised and confused to find that a vast amount of logic was devoted to simply reversing DC coefficient prediction. At the end of a huge amount of frame reconstruction code was a small, humble call to an IDCT function.
What I would like to get across here is that Theora is rather different than most video codecs, in just about every way you can name (no, wait: the base quantization matrix for golden frames is the same as the quantization matrix found in JPEG). As for the idea that most DCT-based codecs are all fundamentally the same, ironically, you can’t even count on that with Theora– its DCT is different than the one found in MPEG-1/2/4, H.263, and JPEG (which all use the same DCT). This was likely done in On2’s valiant quest to make everything about the codec just different enough from every other popular codec, which runs quite contrary to the hope that ASIC vendors should be able to simply re-use a bunch of stuff used from other codecs.
H.263 and wmv3-based codecs and can optionally do non-rounded hpel, so that’s another two functions that could be shared (3 counting the fullpel position.) The center position for Theora is the average of only two of the four surrounding pixels however, so that’s another function completely unique to Theora.
BTW, http://blog.humboldt.co.uk/2006/02/embedded-theora-video.html has more reasons Theora sucks to implement in DSPs (and ASICs I guess.)
Thanks for that link. I’ve been trying to put together a more concrete post about Theora’s shortcomings.
You’ve hit the nail on the head regarding newbies being scared off by the focus on advanced math, i’ve never been able to get involved due to my lack of math abilities.
And Mike’s point was that there isn’t a focus on advanced mathematics, at least, not at the point of implementation of all these bits of codecs.
The truth is that a lot of the more advanced areas that require implementation have already received a lot of attention. As noted, the DCT is definitely not everything in DCT-based codecs and it has been coded and optimised heavily wherever you may find it in code in the wild.
That’s not to say codec specifications don’t contain quite a lot of mathematics, they often do. But I don’t think you need masses of mathematical background to cover what you need to know to work on a codec.
Now, probably the most important thing to note is that the field of multimedia compression, possibly surprisingly, is not all about codecs. Yes, they’re important because they do the decoding and encoding, but there is so much more work to do in the surrounding software that any contributions are sorely needed from anyone willing and interested to persist to improve the state of code, documentation, and so on.
So, if you’re interested in some project, hang around its community for a while, learn how it works, try to find ways in which you can contribute and learn as you go. If you ask intelligent questions, you should get intelligent answers. It’s been a learning process for all of us involved but if you stick at it, the knowledge builds up over time like with anything else.
@Boris: Robert has it right. There is so much more to this field than DCT. We already have a group of smart people who understand the really advanced stuff like DCT and wavelet mathematics.
I’ll let you in on a little secret: I’m not among them! That’s right — I don’t even understand the DCT, at least not as well as you might expect an experienced multimedia hacker to. And frankly, it doesn’t matter because there are so many other problems to solve.
One of the most important things you can do is simply observe and see what needs to improve, figure out which problems no one is tackling, and commit to working on those. That’s how FATE got rolling. And building FATE didn’t require any knowledge of DCT. Aside from the fact that certain ones introduce bit inexact-ness among platforms. :-)
And the primary reason you don’t need to understand (I)DCT is that you just call dsputil.idct or dsputil.idct_add and don’t care a bit about what it does.
Also, with fixed size like 4×4 or 8×8 I’m quite sure you could explain the DCT with little to any maths (e.g. there are those series of pictures that show what kind of image each AC coefficient on its own produces, I don’t think you need mathematical skill to look at and get some ideas from those pictures).
And anyway, IMO mathematics can’t really explain to you why DCT/IDCT is useful for compression (though I admit I wouldn’t know which field to associate that kind of research with either, maybe general computer science? But it is also about the human eye (biology) etc. pp. In the end that’s probably part of the huge not-quite-scientific field of trial-and-error :-P).
Another DCT-based codec, Bink Video, is famous not for its DCT transform but rather for a method of coding coefficients for it (oh, those Huffman trees!).
I somewhat lost my DCT-scare with looking at such a picture
http://www.cs.sfu.ca/CC/365/mark/material/notes/Chap4/Chap4.2/DCT_basis.gif and thinking that the DCT just finds out for me how much a block in question is made up of those pictures. That also visualizes that most blocks look more like things in the top-left corner and less like in the right or at the bottom (the high-frequency stuff).
If you want “hard” in the realm of multimedia hacking, try reverse engineering a codec derived from MP3 audio used by a game where there is no publicly available encoder and where the only publicly available decoder is inside the game itself (which is nigh on impossible to reverse engineer due to copy protection, boatloads of other code you have to wade through/ignore plus a boatload of MMX/SSE/etc and FPU code)
@Jonathan: ‘If you want “hard†in the realm of multimedia hacking, try reverse engineering’; you really could have cut it off right there. :-) Anything pertaining to binary RE is plenty hard for most people. We have been doing it for so long that we sometimes lose track of that fact.