Multimedia Technology Basics: FourCCs, AVI Codecs, ASF Codecs, WAV Codecs, MOV Codecs, RM Codecs, YUV Codecs, RGB Codecs, Lossy and Lossless Codecs and More by Mike Melanson (mike at multimedia.cx) v1.1: September 25, 2005 Copyright (c) 2003-2005 Mike Melanson Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". Contents -------- * Introduction * Codecs * FourCCs * Multimedia Files * What Application Can "Play" This File? * RGB and YUV Colorspaces * References * Acknowledgements * Changelog * GNU Free Documentation License Introduction ------------ This document is intended as a very brief overview of assorted technical topics that will help a developer begin to understand computer multimedia technology. There are many other references for the verbose theory underlying certain of the presented concepts, particularly YUV colorspaces; this document is long on technical explanation and short on abstract concepts. I run a technical multimedia website. Occasionally, I browse my ISP's web server logs which contain information about search engine queries that brought visitors to my site. I am curious to know if people are actually finding what they are looking for. I often see web log records that indicate visitors looking for "asf codec" or "mov codec" or "yuv codec". With any luck, the search engines will index this document and point visitors to more useful information. Codecs ------ "Codec" is an abbreviation for COder/DECoder. Briefly, this refers to any algorithm that codes data into another form and then decodes the coded data in order to recover the original data (more or less). In the context of multimedia technology, this means taking raw audio or video data, which tends to be enormous, and sending it through a coder algorithm to compress it to a considerably smaller size. Then it is stored on disk, transmitted over the network, etc. until it is time to play it back. At such time, the compressed data is sent through the decoder portion of the decoder algorithm which reconstructs the original audio or video data for playback. Actually, a large majority of multimedia codecs do not reconstruct the original audio or video data upon decompression. These codecs fall into the category "lossy". Codecs that reconstruct the original data exactly upon decompression are categorized as "lossless". Why would it be okay to lose information during encoding? Many multimedia codecs throw away subtle pieces of information which, according to empirical research, have little impact on human perception. As a very simple example, 2 adjacent pixels might be so close in color that the coder declares them to be the same color and codes them together as "2 x color1" instead of "1 x color1, 1 x color2". The decoded data will not be exactly the same as the original data, but the goal is to be able to reconstruct a picture that will be "good enough". FourCCs ------- A FourCC is short for "four-character code". FourCCs are very commonly seen in multimedia files in order to identify audio or video codecs, as well as to mark boundaries within the file. A FourCC is generally comprised of 4 ASCII range characters which, when examined as a hex dump, form a human-readable, four-character string. For example: 08 77 73 74 62 6C 00 00 00 7F 73 74 73 64 00 00 .wstbl....stsd.. 00 00 00 00 00 01 00 00 00 6F 53 56 51 33 00 00 .........oSVQ3.. This is taken from an Apple QuickTime file. There is at least one FourCC ('stsd') and two more ('stbl' and 'SVQ3') which are not immediately discernible since the 'w' and 'o' characters preceding them are valid ASCII characters. Since a FourCC is made up of 4 ASCII bytes and each byte is 8 bits, a FourCC is 32 bits long. This works well with modern 32-bit CPUs. As seen in the above example, 'SVQ3' is also represented as 0x53565133 in big-endian hexadecimal notation, or 0x33515653 in little-endian hex notation. Such knowledge alleviates the need for memcmp() and strncmp() functions when scanning for FourCCs. It is important to note that FourCCs do not necessarily need to contain 4 valid alphanumeric ASCII characters. For example, there are a variety of FourCCs in the QuickTime format which are well outside the range. Multimedia Files ---------------- Many multimedia files that carry both audio and video bear extensions such as .avi (Microsoft AVI files), .asf (a.k.a., .wmv and .wma, collectively known as Microsoft ASF files), .mov (Apple QuickTime files), and .rm (RealMedia files). Confusion often arises as one wonders what application can, for example, "play .mov files". That is a very difficult question to answer and here is why: All of the formats mentioned in the preceding paragraph are also referred to as multimedia container formats. All they do is pack chunks of audio and video data together, interleaved, along with some instructions to inform a playback application how the data is to be decoded and presented to the user. This is the typical layout of many multimedia file formats: file header title, creator, other meta-info video header video codec FourCC width, height, colorspace, playback framerate audio header audio codec FourCC bits/sample, playback frequency, channel count file data encoded audio chunk #0 encoded video chunk #0 encoded audio chunk #1 encoded video chunk #1 encoded audio chunk #2 encoded video chunk #2 encoded audio chunk #3 encoded video chunk #3 .. .. Those audio and video chunks can be encoded with any number of audio or video codecs, the FourCCs of which are specified in the file header. See The Almost Definitive FourCC Definition List listed in the reference for more information on the jungle of FourCCs out there, and where they commonly appear. What Application Can "Play" This File? -------------------------------------- Here comes the big question. You have some random Apple QuickTime file. Perhaps you are running some non-Microsoft, non-Apple operating system and there is no official Apple QuickTime application available. Is there a program that can "play" the QT file? Since a QuickTime file can contain many different types of audio or video data, it is not enough to be able to simply decode the QuickTime container format; the audio and video codec formats must be supported as well. This is why there is no simple answer to whether or not a particular multimedia application can "play" a type of multimedia container file format. A player application needs to be able to decode the container format and decode the audio and video codec formats inside. Interleaving ------------ Interleaving is the process of storing alternating audio and video chunks in the data section of a multimedia file: encoded audio chunk #0 encoded video chunk #0 encoded audio chunk #1 encoded video chunk #1 encoded audio chunk #2 encoded video chunk #2 .. .. encoded audio chunk #n encoded video chunk #n Why is this done? Why not just place all of the video data in the file, followed by all of the audio data? For example: encoded video chunk #0 encoded video chunk #1 encoded video chunk #2 .. .. encoded video chunk #n encoded audio chunk #0 encoded audio chunk #1 encoded audio chunk #2 .. .. encoded audio chunk #n Conceptually, this appears to be a valid solution. In practice, however, it falls over. Assuming these audio and video streams are part of the same file on the same disk (almost always the case), there is a physical mechanism called the disk read head which has to constantly make a leap between two different positions on the disk. When the chunks are interleaved, the read head does not need to seek at all; it can read all the data off in a contiguous fashion. RGB and YUV Colorspaces ----------------------- There are two general families of colorspaces for video: RGB and YUV. If you have any experience with computer graphics at all, you have probably been exposed to the red-green-blue (RGB) colorspace. More specifically, you have probably seen packed RGB colorspaces. A packed colorspace has all of the elements interleaved. For example, a packed RGB24 colorspace with 8 bits for each R, G, or B element, is laid out in memory as: R G B R G B R G B ... Sometimes, the opposite ordering is required. This would be expressed as BGR24: B G R B G R B G R ... 24 bits is awkward for many CPUs; 32 bits is far more conducive. Therefore, packed 32-bit RGB formats are often used for video output in the interest of speed. When this is done, a fourth component, usually labeled 'A', is added. ARGB: A R G B A R G B A R G B BGRA: B G R A B G R A B G R A Sometimes, the 'A' component actually represents an alpha transparency value, used for blending RGB images together. For video playback, it is generally ignored. There are also many variations of 15- and 16-bit packed RGB formats. For example, a RGB15 format may pack 5 bits for each component into the lower 15 bits of a 2-byte word and leave the top bit for some other use: byte 0 byte 1 Xrrrrrgg gggbbbbb Of course, how those 2 bytes are stored in memory (high or low byte first) is dependent upon the application. BGR15 may also be seen. RGB16 formats typically allocate an extra bit for green: byte 0 byte 1 rrrrrggg gggbbbbb Many older video codecs rely on packed RGB colorspaces since that is what the hardware was capable of displaying natively. Certain modern codecs still used RGB colorspaces if the source material is conducive, i.e., if it is non-photorealistic or just plain simple. However, many modern video codecs rely on a YUV colorspace. 'YUV' is a frustrating acronym since it is so difficult to guess what the letters could possibly stand for. The colorspace was originally known as YCbCr, with the 'b' and 'r' characters written as subscripts. This is what the components break down as: Y = luminance, or intensity U = Cb = blue chrominance value V = Cr = red chrominance value Where is green represented? Green can be derived from the Y, U, and V values. See the references for more information on converting YUV to RGB and back. Note that with RGB colorspaces, every single pixel has a different R, G, and B sample. The same is not true with YUV colorspaces. YUV operates on the empirical evidence that the human eye is more sensitive to variations in the intensity of a pixel rather than variations in color. Thus, every pixel in a YUV image has an associated Y sample, but groups of pixels share U and V samples. For example, examine the YUY2 colorspace, a.k.a., YUV 4:2:2 or just YUV422. This is a packed YUV colorspace, which means that the Y, U, and V samples are interleaved. The YUV data is laid out in memory as follows (each sample is one byte): Y0 U Y1 V Y0 U Y1 V Y0 U Y1 V Each group of 4 bytes represents 2 pixels. The first pixel is represented by (Y0, U, V) and the second by (Y1, U, V). So each pixel gets a Y sample but has to share a U and a V sample. Perhaps the most common YUV format is I420, a.k.a. YUV 4:2:0 or just YUV420. This is the format used in JPEG, MPEG, and many other modern video codecs. The most notable difference between this colorspace and any other discussed up to this point is that it is a planar format, not a packed format. This means that when the data is stored in memory-- all of the Y data is stored first, then all of the U data, then all of the V data. In I420 data, pixels are grouped in 2x2 blocks: p0 p1 p2 p3 For each 2x2 block, each pixel is presented by a Y sample. But each pixel in the block shares a U and a V sample: Y0 Y1 U V Y2 Y3 As a highly contrived example, consider a I420 image that is 6x2 pixels. p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 The image will be broken up into 3 2x2 blocks for the purpose of representing it as I420: p0 p1 | Y0 Y1 U0 V0 p6 p7 | Y6 Y7 | p2 p3 | Y2 Y3 U1 V1 p8 p9 | Y8 Y9 | p4 p5 | Y4 Y5 U2 V2 p10 p11 | Y10 Y11 The planes of data will be stored in memory as: Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 U0 U1 U2 V0 V1 V2 Another common planar format is YV12. This is precisely the same as I420 except that U and V data planes are reversed. Another planar YUV format occasionally seen is YUV9, a.k.a. YUV 4:1:0 or just YUV410. This is equivalent to I420 except that an image is broken into 4x4 pixel blocks. Each pixel gets its own Y sample while each block shares one U and one V sample over the entire block. Of course, there is also non-subsampled planar YUV available, YUV 4:4:4. In other words, every pixel is represented by a Y, U, and V sample. Notice that YUY2, I420, and YUV9 are all valid FourCCs. Where do these FourCCs come from? I strongly suspect it is related to how many bits or bytes are required to store a single pixel, on average. For YUY2 data, 4 bytes represent 2 pixels, so 2 bytes on average are required to represent 1 pixel. In I420 data: 4 + 1 + 1 = 6 bytes * 8 bits/byte = 48 bits / 4 pixels = 12 bits/pixel And for YUV9 data: (16 + 1 + 1) * 8 = 144 bits / 16 pixels = 9 bits/pixel Note that it is conceptually possible for RGB data to be stored in a planar manner rather than packed. In practice, this is rarely done. References ---------- The Almost Definitive FOURCC Definition List http://www.fourcc.org/ RGB/YUV Pixel Conversion http://www.fourcc.org/fccyvrgb.htm Acknowledgements ---------------- Torben Nielsen (torben at Hawaii.Edu) for correcting Diego Biurrun (diego at biurrun.de) for cosmetic English composition fixes. Changelog --------- v1.1: September 25, 2005 - replaced YV12 with I420 (correct FourCC) and noted what YV12 really means - English composition fixes v1.0: June 14, 2003 - initial release GNU Free Documentation License ------------------------------ see http://www.gnu.org/licenses/fdl.html