My toy VP8 encoder outputs a lot of textual data to illustrate exactly what it’s doing. For those who may not be exactly clear on how this or related algorithms operate, this may prove illuminating.
Let’s look at subblock 0 of macroblock 0 of a luma plane:
subblock 0 (original) 92 91 89 86 91 90 88 86 89 89 89 88 89 87 88 93
Since it’s in the top-left corner of the image to be encoded, the phantom samples above and to the left are implicitly 128 for the purpose of intra prediction (in the VP8 algorithm).
subblock 0 (original) 128 128 128 128 128 92 91 89 86 128 91 90 88 86 128 89 89 89 88 128 89 87 88 93
Using the 4×4 DC prediction mode means averaging the 4 top predictors and 4 left predictors. So, the predictor is 128. Subtract this from each element of the subblock:
subblock 0, predictor removed -36 -37 -39 -42 -37 -38 -40 -42 -39 -39 -39 -40 -39 -41 -40 -35
Next, run the subblock through the forward transform:
subblock 0, transformed -312 7 1 0 1 12 -5 2 2 -3 3 -1 1 0 -2 1
Quantize (integer divide) each element; the DC (first element) and AC (rest of the elements) quantizers are both 4:
subblock 0, quantized -78 1 0 0 0 3 -1 0 0 0 0 0 0 0 0 0
The above block contains the coefficients that are actually transmitted (zigzagged and entropy-encoded) through the bitstream and decoded on the other end.
The decoding process looks something like this– after the same coefficients are decoded and rearranged, they are dequantized (multiplied) by the original quantizers:
subblock 0, dequantized -312 4 0 0 0 12 -4 0 0 0 0 0 0 0 0 0
Note that these coefficients are not exactly the same as the original, pre-quantized coefficients. This is a large part of where the “lossy” in “lossy video compression” comes from.
Next, the decoder generates a base predictor subblock. In this case, it’s all 128 (DC prediction for top-left subblock):
subblock 0, predictor 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
Finally, the dequantized coefficients are shoved through the inverse transform and added to the base predictor block:
subblock 0, reconstructed 91 91 89 85 90 90 89 87 89 88 89 90 88 88 89 92
Again, not exactly the same as the original block, but an incredible facsimile thereof.
Note that this decoding-after-encoding demonstration is not merely pedagogical– the encoder has to decode the subblock because the encoding of successive subblocks may depend on this subblock. The encoder can’t rely on the original representation of the subblock because the decoder won’t have that– it will have the reconstructed block.
For example, here’s the next subblock:
subblock 1 (original) 84 84 87 90 85 85 86 93 86 83 83 89 91 85 84 87
Let’s assume DC prediction once more. The 4 top predictors are still all 128 since this subblock lies along the top row. However, the 4 left predictors are the right edge of the subblock reconstructed in the previous example:
subblock 1 (original) 128 128 128 128 85 84 84 87 90 87 85 85 86 93 90 86 83 83 89 92 91 85 84 87
The DC predictor is computed as (128 + 128 + 128 + 128 + 85 + 87 + 90 + 92 + 4) / 8 = 108
(the extra +4 is for rounding considerations). (Note that in this case, using the original subblock’s right edge would also have resulted in 108, but that’s beside the point.)
Continuing through the same process as in subblock 0:
subblock 1, predictor removed -24 -24 -21 -18 -23 -23 -22 -15 -22 -25 -25 -19 -17 -23 -24 -21 subblock 1, transformed -173 -9 14 -1 2 -11 -4 0 1 6 -2 3 -5 1 0 1 subblock 1, quantized -43 -2 3 0 0 -2 -1 0 0 1 0 0 -1 0 0 0 subblock 1, dequantized -172 -8 12 0 0 -8 -4 0 0 4 0 0 -4 0 0 0 subblock 1, predictor 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 108 subblock 1, reconstructed 84 84 87 89 86 85 87 91 86 83 84 89 90 85 84 88
I hope this concrete example (straight from a working codec) clarifies this part of the VP8 process.
Does VP8 really use the average of both blocks for DC prediction? Other codecs chose one specific block and take the DC prediction of that…
I mean, the way you do it DC predicition would work horribly badly if e.g. the video started with an all-black first 16 rows.
Nice post. Reimar: There are several prediction modes, including ones that ignore either the top or left ‘phantom samples’ (is that the official name?).
@Tim: Phantom samples is my name for it. :-) My encoder maintains a macroblock data structure named phantom_mb that manages the out of frame stuff.
@Reimar: Yes, VP8 always averages in the phantom samples, as opposed to H.264 which will omit the phantom samples.
Not completely sure about H.264 but for MPEG-4 it’s not really that it omits it but that it runs what you could call an extremely basic edge-detection filter on the 3 surrounding DC value and then picks the one that is most likely on the same size as the current MB.
In the case of the border pixels this indeed ends up discarding those outside the frame.
s/size/side/
And the algorithm is to pick the one out of left and top that differs more from the top-left one.
Which for any sharp mostly horizontal or mostly vertical lines certainly gives a much better predictor than averaging (of course there are also cases where it does much worse).