I wanted to see if I could write an extremely basic VP8 encoder. It turned out to be one of the hardest endeavors I have ever attempted (and arguably one of the least successful).
I started with the Big Buck Bunny title image:
And this is the best encoding that this experiment could yield:
Squint hard enough and you can totally make out the logo. Pretty silly effort, I know. It should also be noted that the resultant .webm file holding that single 400×225 image was 191324 bytes. When FFmpeg decoded it to a PNG, it was only 187200 bytes.
Remember my post about a naive SVQ1 encoder? Long story short, I set out to do the same thing with VP8. (I wanted to do the same thing with VP3/Theora for years. But take a good look at what it would entail to create even the most basic bitstream. As involved as VP8 may be, its bitstream is absolutely trivial compared to VP3/Theora.)
With the naive SVQ1 encoder, the goal was to create a minimally compliant SVQ1 encoded bitstream. For this exercise, I similarly hypothesized what it would take to create the most basic, syntactically correct VP8 bitstream with the least amount of effort. These are the overall steps I came up with:
- Create a basic bitstream header that disables any extra features (no modification of default tables)
- Use a static quantizer
- Use intra 16×16 coding for each macroblock
- Use vertical prediction for the 16×16 intra coding
For coding each macroblock:
- Subtract vertical predictor from each row
- Perform forward transform on each 4×4 sub block
- Perform forward WHT on luma plane DCT coefficients
- Pack the coefficients into the bitstream via the Boolean encoder
It all sounds so simple. But, like I said in the SVQ1 post, it’s all very much like carefully bootstrapping a program to run on a particular CPU, and the VP8 decoder serves as the CPU. I’m confident that I have the bitstream encoding correct because, at the very least, the decoder agrees precisely with the encoder about the numbers represented by those 0s and 1s.
Compromises were made for the sake of getting some vaguely recognizable image encoded in a minimally valid manner. One big stumbling block is that I couldn’t seem to encode an end of block (EOB) condition correctly. I then realized that it’s perfectly valid to just encode a lot of zero coefficients rather than signaling EOB. An encoding travesty, I know, and likely one reason that the resulting filesize is so huge.
More drama occurred when I hit my first block that had all zeros. There were complications in that situation that I couldn’t seem to avoid. So I forced the first AC coefficient to be 1 in that case. Hey, the decoder liked it.
As for the generally weird look of the decoded image, I’m thinking that could either be: A) an artifact of forcing 16×16 vertical prediction or; or B) a mistake in the way that I transformed and predicted stuff before sending it to the decoder. The smart money is on a combination of both A and B.
Then again, as the SVQ1 experiment demonstrated, I shouldn’t expect extraordinary visual quality when setting the bar this low (i.e., just getting some bag of bits that doesn’t make the decoder barf).