Id Software's .RoQ Video File Format
                   ------------------------------------

                        Dr. Tim Ferguson, 2001.

Some time ago saw the start of a popular series of first person
perspective games created by Id Software called Quake.  The third
release of the Quake series has seen a significant improvement in
the video coding technique used, when compared to its predecessor.
This video coder has also been used in several other earlier games
as well as the recent Return to Castle Wolfenstein.

In the RoQ video file format, audio samples are DPCM coded and the
video frames are coded using motion blocks and vector quantisation.
What follows is a brief description of the RoQ video file format.

Note: All information on the RoQ file format has been obtained without
   decompiling the Quake III Arena game or the RoQ encoder.  Information
   was obtained by giving known input audio samples and video frames to
   the RoQ encoder and analysing the resulting text output from the
   encoder and the RoQ file it generated.

All multi-byte values used in the RoQ file format are in least significant
byte (LSB) ordering (ie: Intel order).  It is interesting to note that the
structure and some coding methods used in RoQ are not too different to the
Cinepak (CVID) video codec present in AVI files.

The start of an RoQ file begins with the following 8 bytes:
   0x84 0x10 0xff 0xff 0xff 0xff 0x1e 0x00
What these bytes mean, I am not exactly sure.  As far as I can tell, they
can always be used to identify an RoQ video stream.  These eight bytes
may follow the chunk syntax described below (with ID = 0x1084,
Size = 0xffffffff = -1, and Argument = 0x001e).

Following the eight byte header, an RoQ file is made up of multiple chunks.
Each chunk contains the following header at its start:

           words                 Field Name                    Type
         +---------------+
 0 - 1   |               |       Chunk ID                      Unsigned
         +---------------+
 2 - 3   |               |       Chunk Size                    Unsigned
         +-             -+
 4 - 5   |               |
         +---------------+
 6 - 7   |               |       Chunk Argument                Unsigned
         +---------------+

Chunk ID - Identification of the type of data coded in this chunk.
     0x1001 - Video information (RoQ_INFO)
     0x1002 - Quad codebook definition (RoQ_QUAD_CODEBOOK)
     0x1011 - Quad vector quantised video frame (RoQ_QUAD_VQ)
     0x1020 - Mono sound samples (RoQ_SOUND_MONO)
     0x1021 - Stereo sound samples (RoQ_SOUND_STEREO)
Chunk Size - The number of bytes which follow this header (ie: not including
   the header) that contain data related to this chunk.
Chunk Argument - A word argument for this chunk.  The definition of the value
   depends on the type of chunk.

A typical RoQ file with audio and video contains the following chunks:

     +-----------------------+
     | RoQ_SOUND_STEREO/MONO |
     +-----------------------+
     | RoQ_INFO              |
     +-----------------------+
     | RoQ_QUAD_CODEBOOK     |
     +-----------------------+
     | RoQ_QUAD_VQ (frame 1) |
     +-----------------------+
     | RoQ_SOUND_STEREO/MONO |
     +-----------------------+
     | RoQ_QUAD_CODEBOOK     |
     +-----------------------+
     | RoQ_QUAD_VQ (frame 2) |
     +-----------------------+
     |    .      .      .    |
          .      .      .     
     |    .      .      .    |
     +-----------------------+
     | RoQ_SOUND_STEREO/MONO |
     +-----------------------+
     | RoQ_QUAD_CODEBOOK     |
     +-----------------------+
     | RoQ_QUAD_VQ (frame n) |
     +-----------------------+

An RoQ file containing video alone will look the same minus the
RoQ_SOUND chunks.  Each of these chunks are described in more detail
in the following sections.


 --------------------
  0x1001 - RoQ_INFO
 --------------------

The RoQ_INFO chunk contains pixel width and height information for the
video sequence.  The chunk argument is not used here and is always zero.
The format of this chunk is as follows:

           words                 Field Name                    Type
         +---------------+
 0 - 1   |               |       Video Width                   Unsigned
         +---------------+
 2 - 3   |               |       Video Height                  Unsigned
         +---------------+
 4 - 5   |               |       Unused Word 1                 Unsigned
         +---------------+
 6 - 7   |               |       Unused Word 2                 Unsigned
         +---------------+

Video Width - The pixel width of each video frame.
Video Height - The pixel height of each video frame.
Unused Word 1 - Always 8.  (Probably the block dimension?)
Unused Word 2 - Always 4.  (Probably the sub-block dimension?)


 -----------------------------
  0x1002 - RoQ_QUAD_CODEBOOK
 -----------------------------

This chunk defines two vector codebooks which are used to encode video
blocks in the RoQ_QUAD_VQ chunk.  The first part of the chunk defines a
2x2 pixel vector codebook.  The second part of the chunk defines a 4x4
pixel vector codebook by using four indexes into the 2x2 pixel vector
codebook.  The chunk argument specifies the number of vector cells which
make up each codebook.  The upper byte of the chunk argument specifies the
number of 2x2 pixel vector cells, and the lower byte of the chunk argument
specifies the number of 4x4 pixel vector cells.  A value of zero for the
2x2 pixel codebook size indicates there are 256 entries in this codebook.
If there are enough bytes left in the chunk, then a value of zero for the
4x4 pixel codebook size also indicates there are 256 entries.

Video frames in an RoQ file are coded using the YCbCr 4:2:0 colour space.
That is, the RGB components are transformed to YCbCr components and each
of the Cb and Cr components are subsampled to a quarter of their size.
Conversion from RoQ's YCbCr back to RGB can be achieved with the following
standard matrix multiplication:

     | R |   | 1.00000   0.00000   1.40200 | | Y  |
     | G | = | 1.00000  -0.34414  -0.71414 | | Cb |
     | B |   | 1.00000   1.77200   0.00000 | | Cr |

A 2x2 pixel vector codebook cell consists of six bytes.  The first four
bytes are the luminance or Y values for each of the four pixels.  The
remaining two bytes are the subsampled chrominance values, or Cb and Cr
respectively:

     +----+----+  +----+  +----+
     | Y0 | Y1 |  | Cb |  | Cr |
     +----+----+  +----+  +----+
     | Y2 | Y3 |
     +----+----+

Therefore the 2x2 pixel vector codebook is represented at the start of this
chunk as follows:

           +----+----+----+----+----+----+
   cell 0: | Y0 | Y1 | Y2 | Y3 | Cb | Cr |  bytes 0 - 5
           +----+----+----+----+----+----+
   cell 1: | Y0 | Y1 | Y2 | Y3 | Cb | Cr |  bytes 6 - 11
           +----+----+----+----+----+----+
   cell 2: | Y0 | Y1 | Y2 | Y3 | Cb | Cr |  bytes 12 - 17
           +----+----+----+----+----+----+
           |    .    .    .    .    .    |
                .    .    .    .    .
           |    .    .    .    .    .    |
           +----+----+----+----+----+----+
   cell n: | Y0 | Y1 | Y2 | Y3 | Cb | Cr |  bytes 6*n - 6*n+5
           +----+----+----+----+----+----+

where n is the number of 2x2 codebook entries defined in the chunk argument
minus one.

Immediately following the 2x2 pixel codebook is the 4x4 pixel codebook.
The 4x4 pixel codebook contains four bytes for each codebook vector cell.
Each byte represents a vector offset into the 2x2 pixel codebook which
we label V1, V2, V3, and V4.  Therefore, using the four 2x2 pixel blocks,
a single 4x4 pixel block is defined as follows:

     +------+------+------+------+   +------+------+   +------+------+
     | V1Y0 | V1Y1 | V2Y0 | V2Y1 |   | V1Cb | V2Cb |   | V1Cr | V2Cr |
     +------+------+------+------+   +------+------+   +------+------+
     | V1Y2 | V1Y3 | V2Y2 | V2Y3 |   | V3Cb | V4Cb |   | V3Cr | V4Cr |
     +------+------+------+------+   +------+------+   +------+------+
     | V3Y0 | V3Y1 | V4Y0 | V4Y1 |
     +------+------+------+------+
     | V3Y2 | V3Y3 | V4Y2 | V4Y3 |
     +------+------+------+------+

The 4x4 pixel codebook is represented as the second part of this chunk as
follows:

           +----+----+----+----+
   cell 0: | V1 | V2 | V3 | V4 |  bytes 0 - 3
           +----+----+----+----+
   cell 1: | V1 | V2 | V3 | V4 |  bytes 4 - 7
           +----+----+----+----+
   cell 2: | V1 | V2 | V3 | V4 |  bytes 8 - 11
           +----+----+----+----+
           |    .    .    .    |
                .    .    .
           |    .    .    .    |
           +----+----+----+----+
   cell n: | V1 | V2 | V3 | V4 |  bytes 4*n - 4*n+3
           +----+----+----+----+

where n is the number of 4x4 codebook entries defined in the chunk argument
minus one.


 -----------------------
  0x1011 - RoQ_QUAD_VQ
 -----------------------

The RoQ video format implements a form of quadtree style vector quantisation.
At the top level, 16x16 pixel macro blocks are encoded.  These macro blocks
are further divided into 8x8 pixel blocks, and these blocks may either be
coded or further divided into 4x4 pixel sub blocks:

      +-----+-----++-----------+
      |     |     ||           |
      |     |     ||           |
      |     |     ||           |
      +-----+--+--++           |
      |     |  |  ||           |
      |     +--+--+|           |
      |     |  |  ||           |
      +=====+==+==++===========+
      |           ||           |
      |           ||           |
      |           ||           |
      |           ||           |
      |           ||           |
      |           ||           |
      |           ||           |
      +-----------++-----------+

All blocks are coded in left to right, top to bottom order (top-left,
top-right, bottom-left and bottom-right).  An 8x8 pixel block may be
coded by simply skipping it, using a codebook vector defined in the
RoQ_QUAD_CODEBOOK, or coded using a motion block.  When using a motion
block, a motion vector specifies the location of the block to be used.
The RoQ_QUAD_VQ chunk argument specifies the mean of the x and y
motion vectors, which we refer to as Mx and My respectively.  The upper
byte of the chunk argument is Mx, and the lower byte is My.

The RoQ video display is managed using the old video game/graphics technique
of double buffering.  In double buffering, two video buffers are used such
that while one is being displayed, the other is being updated or rendered
with new information.  When the rendering is complete, the two buffers are
swapped.  That is, the newly rendered buffer becomes the display buffer,
and the old display buffer now becomes the buffer to be rendered to.  From
my experiments, the frame rate for video appears to be 30 frames per second.
That is, the buffers should swap 30 times per second.

A coded frame consists of a series of coding type words, followed by
the coding arguments.  The coding arguments combine with the coding type
to specify how the block is to the be represented:


      7 6 5 4 3 2 1 0        Field Name                    Type
     +---------------+
  0  |               |       Coding Type Word              Unsigned
     +-             -+
  1  |               |
     +---------------+
  2  |       .       |       Coding Arguments              Unsigned
             .
  n  |       .       |
     +---------------+
 n+1 |               |       Coding Type Flags             Unsigned
     +-             -+
 n+2 |               |
     +---------------+
 n+3 |       .       |       Coding Arguments              Unsigned
             .
     |       .       |

Each 16-bit coding type word is constructed with eight sets of 2-bit
coding types.  These 2-bit values enable eight 8x8 or 4x4 pixel blocks
to be coded.  A 2-bit value enables a 8x8 pixel block to be coded in
one of four ways:

   0 - (MOT) Skip over the block and leave the video data defined by the
     previous encoding of this video buffer unchanged.  This type does
     not use any coding arguments.

   1 - (FCC) Code using a block taken from the alternate video buffer
     (the display buffer) at a specified x and y offset.  This coding
     type includes a one byte coding argument.  The lower four bits
     specify the x deviation from the mean motion (Mx) and the upper
     four bits the y deviation from the mean motion (My).  Therefore,
     a block at location X,Y is copied from the alternate buffer at
     location Dx, Dy where:
            Dx = X + 8 - (argument >> 4) - Mx
            Dy = Y + 8 - (argument & 0x0F) - My

   2 - (SLD) Code using quad vector quantisation.  This coding type
     includes a one byte coding argument to specify which of the 4x4 quad
     codebook vectors defined in the previous RoQ_QUAD_CODEBOOK is to be
     used to represent the 8x8 pixel block.  The 4x4 pixel block is simply
     upsampled (doubled in size) to 8x8 pixels.

   3 - (CCC) The 8x8 pixel block is divided into four 4x4 pixel quadrants.
     That is, quadtree partition.  The coding of each of these sub blocks
     then proceeds immediately in the left-to-right, top-to-bottom order.

Coding of the 4x4 pixel sub blocks which result from the split using a
CCC coding type proceeds in a similar fashion to the 8x8 pixel blocks.
As before, the MOT blocks are skipped, the FCC blocks are coded using a
motion block from the alternate video buffer and the SLD blocks are coded
using a single quad motion vector (in this case upsampling is not required).
The difference is in the coding of the CCC coding type.

The CCC coding type for a 4x4 pixel block may be considered as
another sub-division in the quadtree representation of the macro block
down to four 2x2 pixel blocks.  Each of these 2x2 pixel blocks is then
vector quantised.  Therefore, in the 4x4 pixel CCC coding type, four one
byte coding arguments are included.  Each of the four arguments specifies
a 2x2 pixel codebook vector defined by the previous RoQ_QUAD_CODEBOOK to
represent one quarter of the 4x4 pixel block.  This may also be thought of
as a quad vector being represented explicitly by the coding argument
rather than the codebook.


 --------------------------
  0x1020 - RoQ_SOUND_MONO
 --------------------------

Audio in a RoQ file is defined using 16-bit per sample and a 22050Hz sample
rate.  These audio samples are coded into the RoQ chunk using DPCM.  Each
sample is coded using a sample prediction (derived from past information),
added to a sample prediction error which is read from the RoQ file:

   sample[n] = prediction[n] + prediction error[n]

In RoQ files (as in most DPCM coded files), the sample prediction is simply
the previously coded sample:

   prediction[n] = sample[n-1]

The initial prediction (prediction[0]) is defined by the chunks argument,
where the chunk argument is the initial 16-bit sample prediction.

Each byte in this chunk, v[n], represents the square-root of the prediction
error for the corresponding sample.  This value is coding using the following:

   if(v[n] < 128) prediction error[n] = v[n] * v[n]
   else prediction error[n] = -((v[n] - 128) * (v[n] - 128))


 ----------------------------
  0x1021 - RoQ_SOUND_STEREO
 ----------------------------

Stereo audio is encoded into a RoQ file in much the same was as mono audio
is encoded in the RoQ_SOUND_MONO chunk.  In this case, the chunk argument
contains two initial sample predictions for each of the stereo channels.
The upper 8-bits of the chunk argument contains the upper 8-bits of the
left channel prediction, and the lower 8-bits of the chunk argument contains
the upper 8-bits of the right channel prediction.

Decoding of this chunk is conducted in the same manner as RoQ_SOUND_MONO
except the left and right sample prediction errors are interleaved.  That
is, the first byte is the left channel sample prediction error, the second
byte is the right channel sample prediction error, and so forth.


---------------------------------------------------------------------
This document was written by Dr. Tim Ferguson, 2001.

For more details, on this and other codecs, including source code, visit:
    http://www.csse.monash.edu.au/~timf/videocodec.html 

To contact me, email: timf@csse.monash.edu.au
---------------------------------------------------------------------