Id Software's .RoQ Video File Format ------------------------------------ Dr. Tim Ferguson, 2001. Some time ago saw the start of a popular series of first person perspective games created by Id Software called Quake. The third release of the Quake series has seen a significant improvement in the video coding technique used, when compared to its predecessor. This video coder has also been used in several other earlier games as well as the recent Return to Castle Wolfenstein. In the RoQ video file format, audio samples are DPCM coded and the video frames are coded using motion blocks and vector quantisation. What follows is a brief description of the RoQ video file format. Note: All information on the RoQ file format has been obtained without decompiling the Quake III Arena game or the RoQ encoder. Information was obtained by giving known input audio samples and video frames to the RoQ encoder and analysing the resulting text output from the encoder and the RoQ file it generated. All multi-byte values used in the RoQ file format are in least significant byte (LSB) ordering (ie: Intel order). It is interesting to note that the structure and some coding methods used in RoQ are not too different to the Cinepak (CVID) video codec present in AVI files. The start of an RoQ file begins with the following 8 bytes: 0x84 0x10 0xff 0xff 0xff 0xff 0x1e 0x00 What these bytes mean, I am not exactly sure. As far as I can tell, they can always be used to identify an RoQ video stream. These eight bytes may follow the chunk syntax described below (with ID = 0x1084, Size = 0xffffffff = -1, and Argument = 0x001e). Following the eight byte header, an RoQ file is made up of multiple chunks. Each chunk contains the following header at its start: words Field Name Type +---------------+ 0 - 1 | | Chunk ID Unsigned +---------------+ 2 - 3 | | Chunk Size Unsigned +- -+ 4 - 5 | | +---------------+ 6 - 7 | | Chunk Argument Unsigned +---------------+ Chunk ID - Identification of the type of data coded in this chunk. 0x1001 - Video information (RoQ_INFO) 0x1002 - Quad codebook definition (RoQ_QUAD_CODEBOOK) 0x1011 - Quad vector quantised video frame (RoQ_QUAD_VQ) 0x1020 - Mono sound samples (RoQ_SOUND_MONO) 0x1021 - Stereo sound samples (RoQ_SOUND_STEREO) Chunk Size - The number of bytes which follow this header (ie: not including the header) that contain data related to this chunk. Chunk Argument - A word argument for this chunk. The definition of the value depends on the type of chunk. A typical RoQ file with audio and video contains the following chunks: +-----------------------+ | RoQ_SOUND_STEREO/MONO | +-----------------------+ | RoQ_INFO | +-----------------------+ | RoQ_QUAD_CODEBOOK | +-----------------------+ | RoQ_QUAD_VQ (frame 1) | +-----------------------+ | RoQ_SOUND_STEREO/MONO | +-----------------------+ | RoQ_QUAD_CODEBOOK | +-----------------------+ | RoQ_QUAD_VQ (frame 2) | +-----------------------+ | . . . | . . . | . . . | +-----------------------+ | RoQ_SOUND_STEREO/MONO | +-----------------------+ | RoQ_QUAD_CODEBOOK | +-----------------------+ | RoQ_QUAD_VQ (frame n) | +-----------------------+ An RoQ file containing video alone will look the same minus the RoQ_SOUND chunks. Each of these chunks are described in more detail in the following sections. -------------------- 0x1001 - RoQ_INFO -------------------- The RoQ_INFO chunk contains pixel width and height information for the video sequence. The chunk argument is not used here and is always zero. The format of this chunk is as follows: words Field Name Type +---------------+ 0 - 1 | | Video Width Unsigned +---------------+ 2 - 3 | | Video Height Unsigned +---------------+ 4 - 5 | | Unused Word 1 Unsigned +---------------+ 6 - 7 | | Unused Word 2 Unsigned +---------------+ Video Width - The pixel width of each video frame. Video Height - The pixel height of each video frame. Unused Word 1 - Always 8. (Probably the block dimension?) Unused Word 2 - Always 4. (Probably the sub-block dimension?) ----------------------------- 0x1002 - RoQ_QUAD_CODEBOOK ----------------------------- This chunk defines two vector codebooks which are used to encode video blocks in the RoQ_QUAD_VQ chunk. The first part of the chunk defines a 2x2 pixel vector codebook. The second part of the chunk defines a 4x4 pixel vector codebook by using four indexes into the 2x2 pixel vector codebook. The chunk argument specifies the number of vector cells which make up each codebook. The upper byte of the chunk argument specifies the number of 2x2 pixel vector cells, and the lower byte of the chunk argument specifies the number of 4x4 pixel vector cells. A value of zero for the 2x2 pixel codebook size indicates there are 256 entries in this codebook. If there are enough bytes left in the chunk, then a value of zero for the 4x4 pixel codebook size also indicates there are 256 entries. Video frames in an RoQ file are coded using the YCbCr 4:2:0 colour space. That is, the RGB components are transformed to YCbCr components and each of the Cb and Cr components are subsampled to a quarter of their size. Conversion from RoQ's YCbCr back to RGB can be achieved with the following standard matrix multiplication: | R | | 1.00000 0.00000 1.40200 | | Y | | G | = | 1.00000 -0.34414 -0.71414 | | Cb | | B | | 1.00000 1.77200 0.00000 | | Cr | A 2x2 pixel vector codebook cell consists of six bytes. The first four bytes are the luminance or Y values for each of the four pixels. The remaining two bytes are the subsampled chrominance values, or Cb and Cr respectively: +----+----+ +----+ +----+ | Y0 | Y1 | | Cb | | Cr | +----+----+ +----+ +----+ | Y2 | Y3 | +----+----+ Therefore the 2x2 pixel vector codebook is represented at the start of this chunk as follows: +----+----+----+----+----+----+ cell 0: | Y0 | Y1 | Y2 | Y3 | Cb | Cr | bytes 0 - 5 +----+----+----+----+----+----+ cell 1: | Y0 | Y1 | Y2 | Y3 | Cb | Cr | bytes 6 - 11 +----+----+----+----+----+----+ cell 2: | Y0 | Y1 | Y2 | Y3 | Cb | Cr | bytes 12 - 17 +----+----+----+----+----+----+ | . . . . . | . . . . . | . . . . . | +----+----+----+----+----+----+ cell n: | Y0 | Y1 | Y2 | Y3 | Cb | Cr | bytes 6*n - 6*n+5 +----+----+----+----+----+----+ where n is the number of 2x2 codebook entries defined in the chunk argument minus one. Immediately following the 2x2 pixel codebook is the 4x4 pixel codebook. The 4x4 pixel codebook contains four bytes for each codebook vector cell. Each byte represents a vector offset into the 2x2 pixel codebook which we label V1, V2, V3, and V4. Therefore, using the four 2x2 pixel blocks, a single 4x4 pixel block is defined as follows: +------+------+------+------+ +------+------+ +------+------+ | V1Y0 | V1Y1 | V2Y0 | V2Y1 | | V1Cb | V2Cb | | V1Cr | V2Cr | +------+------+------+------+ +------+------+ +------+------+ | V1Y2 | V1Y3 | V2Y2 | V2Y3 | | V3Cb | V4Cb | | V3Cr | V4Cr | +------+------+------+------+ +------+------+ +------+------+ | V3Y0 | V3Y1 | V4Y0 | V4Y1 | +------+------+------+------+ | V3Y2 | V3Y3 | V4Y2 | V4Y3 | +------+------+------+------+ The 4x4 pixel codebook is represented as the second part of this chunk as follows: +----+----+----+----+ cell 0: | V1 | V2 | V3 | V4 | bytes 0 - 3 +----+----+----+----+ cell 1: | V1 | V2 | V3 | V4 | bytes 4 - 7 +----+----+----+----+ cell 2: | V1 | V2 | V3 | V4 | bytes 8 - 11 +----+----+----+----+ | . . . | . . . | . . . | +----+----+----+----+ cell n: | V1 | V2 | V3 | V4 | bytes 4*n - 4*n+3 +----+----+----+----+ where n is the number of 4x4 codebook entries defined in the chunk argument minus one. ----------------------- 0x1011 - RoQ_QUAD_VQ ----------------------- The RoQ video format implements a form of quadtree style vector quantisation. At the top level, 16x16 pixel macro blocks are encoded. These macro blocks are further divided into 8x8 pixel blocks, and these blocks may either be coded or further divided into 4x4 pixel sub blocks: +-----+-----++-----------+ | | || | | | || | | | || | +-----+--+--++ | | | | || | | +--+--+| | | | | || | +=====+==+==++===========+ | || | | || | | || | | || | | || | | || | | || | +-----------++-----------+ All blocks are coded in left to right, top to bottom order (top-left, top-right, bottom-left and bottom-right). An 8x8 pixel block may be coded by simply skipping it, using a codebook vector defined in the RoQ_QUAD_CODEBOOK, or coded using a motion block. When using a motion block, a motion vector specifies the location of the block to be used. The RoQ_QUAD_VQ chunk argument specifies the mean of the x and y motion vectors, which we refer to as Mx and My respectively. The upper byte of the chunk argument is Mx, and the lower byte is My. The RoQ video display is managed using the old video game/graphics technique of double buffering. In double buffering, two video buffers are used such that while one is being displayed, the other is being updated or rendered with new information. When the rendering is complete, the two buffers are swapped. That is, the newly rendered buffer becomes the display buffer, and the old display buffer now becomes the buffer to be rendered to. From my experiments, the frame rate for video appears to be 30 frames per second. That is, the buffers should swap 30 times per second. A coded frame consists of a series of coding type words, followed by the coding arguments. The coding arguments combine with the coding type to specify how the block is to the be represented: 7 6 5 4 3 2 1 0 Field Name Type +---------------+ 0 | | Coding Type Word Unsigned +- -+ 1 | | +---------------+ 2 | . | Coding Arguments Unsigned . n | . | +---------------+ n+1 | | Coding Type Flags Unsigned +- -+ n+2 | | +---------------+ n+3 | . | Coding Arguments Unsigned . | . | Each 16-bit coding type word is constructed with eight sets of 2-bit coding types. These 2-bit values enable eight 8x8 or 4x4 pixel blocks to be coded. A 2-bit value enables a 8x8 pixel block to be coded in one of four ways: 0 - (MOT) Skip over the block and leave the video data defined by the previous encoding of this video buffer unchanged. This type does not use any coding arguments. 1 - (FCC) Code using a block taken from the alternate video buffer (the display buffer) at a specified x and y offset. This coding type includes a one byte coding argument. The lower four bits specify the x deviation from the mean motion (Mx) and the upper four bits the y deviation from the mean motion (My). Therefore, a block at location X,Y is copied from the alternate buffer at location Dx, Dy where: Dx = X + 8 - (argument >> 4) - Mx Dy = Y + 8 - (argument & 0x0F) - My 2 - (SLD) Code using quad vector quantisation. This coding type includes a one byte coding argument to specify which of the 4x4 quad codebook vectors defined in the previous RoQ_QUAD_CODEBOOK is to be used to represent the 8x8 pixel block. The 4x4 pixel block is simply upsampled (doubled in size) to 8x8 pixels. 3 - (CCC) The 8x8 pixel block is divided into four 4x4 pixel quadrants. That is, quadtree partition. The coding of each of these sub blocks then proceeds immediately in the left-to-right, top-to-bottom order. Coding of the 4x4 pixel sub blocks which result from the split using a CCC coding type proceeds in a similar fashion to the 8x8 pixel blocks. As before, the MOT blocks are skipped, the FCC blocks are coded using a motion block from the alternate video buffer and the SLD blocks are coded using a single quad motion vector (in this case upsampling is not required). The difference is in the coding of the CCC coding type. The CCC coding type for a 4x4 pixel block may be considered as another sub-division in the quadtree representation of the macro block down to four 2x2 pixel blocks. Each of these 2x2 pixel blocks is then vector quantised. Therefore, in the 4x4 pixel CCC coding type, four one byte coding arguments are included. Each of the four arguments specifies a 2x2 pixel codebook vector defined by the previous RoQ_QUAD_CODEBOOK to represent one quarter of the 4x4 pixel block. This may also be thought of as a quad vector being represented explicitly by the coding argument rather than the codebook. -------------------------- 0x1020 - RoQ_SOUND_MONO -------------------------- Audio in a RoQ file is defined using 16-bit per sample and a 22050Hz sample rate. These audio samples are coded into the RoQ chunk using DPCM. Each sample is coded using a sample prediction (derived from past information), added to a sample prediction error which is read from the RoQ file: sample[n] = prediction[n] + prediction error[n] In RoQ files (as in most DPCM coded files), the sample prediction is simply the previously coded sample: prediction[n] = sample[n-1] The initial prediction (prediction[0]) is defined by the chunks argument, where the chunk argument is the initial 16-bit sample prediction. Each byte in this chunk, v[n], represents the square-root of the prediction error for the corresponding sample. This value is coding using the following: if(v[n] < 128) prediction error[n] = v[n] * v[n] else prediction error[n] = -((v[n] - 128) * (v[n] - 128)) ---------------------------- 0x1021 - RoQ_SOUND_STEREO ---------------------------- Stereo audio is encoded into a RoQ file in much the same was as mono audio is encoded in the RoQ_SOUND_MONO chunk. In this case, the chunk argument contains two initial sample predictions for each of the stereo channels. The upper 8-bits of the chunk argument contains the upper 8-bits of the left channel prediction, and the lower 8-bits of the chunk argument contains the upper 8-bits of the right channel prediction. Decoding of this chunk is conducted in the same manner as RoQ_SOUND_MONO except the left and right sample prediction errors are interleaved. That is, the first byte is the left channel sample prediction error, the second byte is the right channel sample prediction error, and so forth. --------------------------------------------------------------------- This document was written by Dr. Tim Ferguson, 2001. For more details, on this and other codecs, including source code, visit: http://www.csse.monash.edu.au/~timf/videocodec.html To contact me, email: timf@csse.monash.edu.au ---------------------------------------------------------------------