VQ Case Study: Sorenson Video 1 | Breaking Eggs And Making Omelettes

Sorenson Video 1 (SVQ1) makes me sentimental. It had a lot to do with why I started multimedia hacking. Strange that it all seems so simple now.

SVQ1 is a stark contrast to our last subject, Cinepak. SVQ1 does not store its codebooks in the encoded video bitstream. Rather, the codebooks are a hardwired characteristic of the coding scheme. That’s actually a really good thing considering that the algorithm is a hierarchical multistage vector quantizer.

“Whoa! Just when I was getting over my aversion to the words ‘vector’ and ‘quantizer’ strung together, they throw in ‘hierarchical’ and ‘multistage’.” Don’t worry, it’s also pretty simple even though it sounds complicated. “Hierarchical”, in this context, simply means that a block of pixels is broken down into smaller blocks of pixels. In SVQ1, the block size starts at 16×16. The encoder may decide to encode a block at that size, or slice it in half, into 2 16×8 blocks, and code each of those individually. The encoder may further subdivide the blocks to obtain 8×8, 8×4, 4×4, and 4×2 sizes.

The “multistage” word refers to the fact that SVQ1 has several layers, or stages, of codebooks that can be combined in order to reconstruct a given vector. In fact, for each block size from 4×2 up through 8×8, there are 6 stages of codebooks for each size, and each codebook contains 16 vectors.

SVQ1 has one trick up its sleeve that I have never seen used anywhere else so far: Vector mean removal. Again, this may sound fancy, but all it means is to average all the numbers in a vector and subtract the average from each number. If all of the numbers in the vector are relatively similar, the mean-removed vector will have rather small components and thus be more efficient to code.

Remember how a major problem in VQ deals with matching vectors from a source image to the best possible vector from a generated codebook? This problem is exacerbated in SVQ1 due to the layers of codebooks that the encoder must evaluate in combination with the hierarchy. The encoder must remove the mean for a vector of samples and encode the mean using the sum of between 0 and 6 vectors from different layers of codebooks, or decide to split the block in half and repeat the process, or find some combination of hierarchical encoding combined with codebook sequences, and then move on to the next vector.

SVQ1 interframes can also take advantage of motion vectors and the residual between a block from the previous frame and the current frame is encoded with the same intraframe hierarchical multistage method.

It all sounds like a lot of trouble to encode, but the reward was that relatively slow machines from the 1998-1999 timeframe could easily play back, e.g., 480×212 video on CPUs that would still struggle with MPEG-type data at a similar resolution, and the files were feasibly distributable via the dialup-dominated internet.

Where do the codebooks come from? As mentioned, the codebooks are a hardcoded component of the SVQ1 coding scheme. I strongly suspect (or would like to think) that the codebooks were constructed as a result of a long and arduous design phase that involved countless pieces of sample media drawn from numerous diverse genres of material. So out of the 2 classic VQ problems (finding optimal codebooks and matching source material to the codebooks), at least that’s one less problem that the encoder needs to worry about. I, for one, am thankful for that since SVQ1 is, to date, the only encoder I have written for the FFmpeg project. Encoding the codebooks would also incur huge overhead in the final encoded bitstream.

SVQ1 has a few weaknesses, to be sure (or “trade-offs” to be neutral about it). I never appreciated the fact the various planes are stored separately. All the Y data is encoded in the stream, followed by all the U data, and then the V data. This means that the entire frame must be decoded before any of it may be processed (e.g., converted to RGB) before display. Owing to its large vector size (16×16) and also its native colorspace (YUV 4:1:0, where a 4×4 block shares one U and one V sample), I can understand why it would be implausible to chop the data into macroblock-type units. However, it might have been nice for the bitstream header to encode relative offsets to the start of the U and V data segments. A decoder could have made use of this to optimize for things like slice dispatch.

Another problem (that I perceive, and I’m pretty sure the guru agrees with me on this point) is that SVQ1 uses a breadth-first algorithm for traversing the vector hierarchy, vs. a depth-first algorithm. It’s a technical nitpick, sure, and I can’t remember all the details. But I strongly recall that it was a headache for coding.

Through it all, SVQ1 is still one of my favorite codecs.

Further Reading:

Description of the Sorenson Video 1 codec at the MultimediaWiki