Simple YUV Coding Formats
by Mike Melanson (mike at multimedia.cx)
v1.1: December 3, 2004


=======================================================================
NOTE: The information in this document is now maintained in Wiki format
at:
  http://wiki.multimedia.cx/index.php?title=ATI_VCR1
  http://wiki.multimedia.cx/index.php?title=Cirrus_Logic_AccuPak
  http://wiki.multimedia.cx/index.php?title=Creative_YUV
  http://wiki.multimedia.cx/index.php?title=Video_XL
=======================================================================


  Copyright (c) 2004 Mike Melanson
  Permission is granted to copy, distribute and/or modify this document
  under the terms of the GNU Free Documentation License, Version 1.2
  or any later version published by the Free Software Foundation;
  with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
  A copy of the license is included in the section entitled "GNU
  Free Documentation License".


Contents
--------
 * Introduction
 * ATI VCR1
 * Cirrus Logic AccuPak (CLJR)
 * Creative YUV (CYUV)
 * Miro/Pinnacle Video XL (VIXL/PIXL)
 * References
 * ChangeLog
 * GNU Free Documentation License


Introduction
------------
There are many ways to code, store and transport YUV video data. This file 
documents various simple methods used for coding such data.

A working knowledge of YUV colorspaces is assumed in this document. For
more information about YUV basics, see the references as the end of this
document.

About terminology: YUV is the same as YCbCr for the purposes of this 
document. Y represents luminance values. U = Cb represents blue 
chrominance values. V = Cr represents red chrominance values. This 
document will also sometimes refer to the U and V samples collectively as 
C samples.


ATI VCR1
--------
The ATI VCR1 codec, identified by the fourcc VCR1, uses differential 
coding to pack Y samples. C samples are left alone. VCR1 is based on a YUV 
4:1:0 colorspace. This means that for each block of 4x4 pixels each pixel 
has a Y sample and the entire block shares both C samples.

The format of a VCR1-encoded video chunk is as follows:

  bytes 0-31    16 16-bit, signed, little-endian deltas used in this frame
  bytes 32..    encoded YUV data

The deltas are apparently 16 bits in width which is somewhat irrelevant
since the Y samples to which they are applied are only 8-bit numbers.

The YUV data is coded after the initial deltas. The data is coded as:

  luminance/chrominance line
  luminance line
  luminance line
  luminance line
  [...]

Every fourth line, starting with line 0, contains both luminance (Y) and 
chrominance (C) data. The other lines only contain Y data. 

Each Y/C line begins with 4 offsets to be used when decoding the Y data 
for the next 4 lines:

  byte 0    offset for this line's Y data
  byte 1    offset for second line's Y data
  byte 2    offset for third line's Y data
  byte 3    offset for fourth line's Y data
  bytes 4.. Y/C data

For the remainder of the data on a Y/C line, these 6 pieces of data:

  Y0 Y1 Y2 Y3 U V

are encoded within groups of 4 bytes of the bytestream. Y0..Y3 are the
next 4 Y samples in the line while U and V are the C samples for the 4 Y
samples as well as the 4 Y samples on each of the next 3 lines (since
this is a YUV 4:1:0 colorspace). The 4 bytes in the group have the
following meaning:

   byte0     byte1    byte2    byte3
  Y3i Y2i      V     Y1i Y0i     U

Bytes 1 and 3 correspond to the V and U samples, respectively. Bytes 0
and 2 break down into 4 4-bit nibbles which do not actually represent
the Y samples. Instead, they index into the delta table from the start
of the frame. The indexed signed delta is applied to this line's Y
offset. For example,

  Y0 = offset + delta_table [ byte2 & 0x0F ]
  Y3 = offset + delta_table [ byte0 >> 4 ]

For the other lines that only contain Y data, each group of 4 bytes 
decodes to 8 Y samples in a similar manner as on the Y/C lines:

   byte0    byte1    byte2    byte3
  Y5i Y4i  Y7i Y6i  Y1i Y0i  Y3i Y2i


Cirrus Logic AccuPak (CLJR)
---------------------------
The Cirrus Logic AccuPak codec, identified by the fourcc CLJR, packs 4 Y 
samples and 2 C samples into 32 bits by representing each Y sample with 5 
bits and each C sample with 6 bits. It is essentially a scaled-down method 
of coding YUV 4:1:1, where each group of 4 pixels on a line is represented 
by a luminance sample each but share C samples.

Each set of 32 bits represents 4 pixels on a line:

  p0 p1 p2 p3

For each set of 32 bits, read left -> right:

  p3.Y = next 5 bits
  p2.Y = next 5 bits
  p1.Y = next 5 bits
  p0.Y = next 5 bits
  Cb/U = next 6 bits
  Cr/V = next 6 bits
             -------
             32 bits

Thus, the first 5 bits represent the Y sample for the last pixel in the 
group of 4 pixels.


Creative YUV (CYUV)
-------------------
Creative YUV, identified by the fourcc CYUV, uses differential coding to 
effectively compress each Y, U, and V sample to 4 bits with some overhead 
at the start of each line. The codec operates on a YUV 4:1:1 colorspace 
which means that each group of 4 pixels on a line has 1 Y sample per 
pixel, but only 1 of each C sample for the entire group.

A chunk of CYUV-encoded data is laid out as:

  bytes 0-15    signed Y predictor byte values
  bytes 16-31   signed U predictor byte values
  bytes 32-47   signed V predictor byte values
  bytes 48..    lines of CYUV-encoded data

The format of each line is as follows:

  byte 0
    bits 7-4  initial U sample and predictor for line
    bits 3-0  initial Y sample and predictor for line
  byte 1
    bits 7-4  initial V sample and predictor for line
    bits 3-0  next Y predictor index
  byte 2
    bits 7-4  next Y predictor index
    bits 3-0  next Y predictor index
  bytes 3..   remaining predictor indices for line

The first 3 bytes contain the setup information for the line. Each initial 
sample (Y, U, and V) actually represents the top 4 bits of the initial 
8-bit sample. The initial sample also serves as the initial predictor. For 
each of the 3 Y predictor indices, use the 4-bit value to index into the 
table of 16 Y predictors, encoded at the start of the frame. Apply each 
predictor to the previous Y value.

At this point, the first group of 4 pixels will be decoded. For each group 
of 4 pixels remaining on the line

  byte 0
    bits 7-4  next U predictor index
    bits 3-0  next Y predictor index
  byte 1
    bits 7-4  next V predictor index
    bits 3-0  next Y predictor index
  byte 2
    bits 7-4  next Y predictor index
    bits 3-0  next Y predictor index

For each predictor index, use the 4 bits to index into the appropriate 
predictor table and apply the predictor to the previous sample of the same 
type (Y, U, or V) and output the sample.


Miro/Pinnacle Video XL (VIXL/PIXL)
----------------------------------
The Miro Video XL codec, identified by the fourcc VIXL, uses
differential coding on a reduced-precision YUV 4:1:1 colorspace image.
Each Y, U, or V component is only 7 bits (where 8 is more typical). Each
group of 32 bits in the bitstream represents 6 5-bit delta table indices
(with 2 unused bits). There is one index for each of the next 4 Y
samples on the line and one index for each of the color samples.

The Pinnacle Video XL codec, indentified by the fourcc PIXL, is
apparently the same algorithm as the Miro codec except that the frames
are 8 bytes longer. However, the same decoding process applies.

For each block of 4 pixels on a line, fetch the next 32 bits as a little
endian number and then swap the 16 bit words to achieve the correct bit
orientation for decoding. To illustrate more clearly, this is the
arrangement of the next 4 8-bit bytes (A, B, C, and D) on disk:

  aaaaaaaa bbbbbbbb cccccccc dddddddd

Load the 4 bytes into a program variable so that the bytes are in this
order:

  dddddddd cccccccc bbbbbbbb aaaaaaaa

Then, swap the upper and lower 16-bit words to achieve this order:

 31                                 0
  bbbbbbbb aaaaaaaa dddddddd cccccccc

Further, the 32-bit blocks are stored in reverse order. So, for example,
if an image is 16 pixels wide, it would have 4 pixel groups per line.
Each pixel group would be represented by a 32-bit doubleword, swapped
and mangled as described previously. The doublewords would be stored in
the bytestream as:

  D3 D2 D1 D0
  
D0 represents the first 4 pixels on the line and D3 represents the final
4 pixels on the line. Thus, a decoder must jump forward in the
bytestream and work backwards through the bytestream while decoding in
the forward direction on a particular line, then jump forward again in 
the bytestream when decoding the next line.

The 32 bits of the doubleword represent the following values:

  bit 31:     unused
  bits 30-26: V delta index
  bits 25-21: U delta index
  bits 20-16: Y3 delta index
  bit 15:     unused
  bits 14-10: Y2 delta index
  bits 9-5:   Y1 delta index
  bits 4-0:   Y0 delta index

Each delta index value is used to index into this table and the
referenced value is added to the previous element on the same plane,
either Y, U, or V:

const int xl_delta_table[32] = {
   0,   1,   2,   3,   4,   5,   6,   7,
   8,   9,  12,  15,  20,  25,  34,  46,
  64,  82,  94, 103, 108, 113, 116, 119,
 120, 121, 122, 123, 124, 125, 126, 127
};

Remember that the YUV components only have 7 bits of precision. Thus,
the second half of the table values all count as negative values.

At the beginning of a line, the Y0, U, and V delta indices actually
represent the top 5 bits of the absolute 7-bit component value.

The final, concise decoding algorithm operates as follows:

  foreach line in image
    foreach 32-bit doubleword, working from right -> left in bytestream
      load doubleword as little-endian number, swap 16-bit words
      if this is the first pixel group in line
        next Y value = (Y0 delta index) << 2
        next U value = (U delta index) << 2
        next V value = (V delta index) << 2
      else
        next Y value = last Y value + xl_delta_table[Y0 delta index]
        next U value = last U value + xl_delta_table[U delta index]
        next V value = last V value + xl_delta_table[V delta index]
      next Y value = last Y value + xl_delta_table[Y1 delta index]
      next Y value = last Y value + xl_delta_table[Y2 delta index]
      next Y value = last Y value + xl_delta_table[Y3 delta index]

Since the components only have 7 bits of meaningful precision, it will
likely be necessary to shift each of the components left once more to
achieve 8 bits of output precision.


References
----------
Multimedia Technology Basics (with an introduction to YUV)
http://www.multimedia.cx/mmbasics.txt

ffmpeg project
http://ffmpeg.sourceforge.net/

Creative YUV Format
http://www.csse.monash.edu.au/~timf/videocodec/cyuv.txt


ChangeLog
---------
v1.1: December 3, 2004
- Miro/Pinnacle Video XL (VIXL/PIXL)

v1.0: September 23, 2004
- initial release


GNU Free Documentation License
------------------------------
see http://www.gnu.org/licenses/fdl.html