Tag Archives: samples

Processing The Unknowns

This is the general process I have been using for working through the unknown video codec samples (but not always in this order):

  • Starting with the FourCC (which is usually how the samples are sorted thanks to my download method), look up codec in the MultimediaWiki to see if something is already known
  • Check the mphq archive to see if similar examples are already cataloged in the V-codecs directory
  • Check the FourCC list to see if they have any knowledge about the codec
  • Consult Google
  • Study the raw bytes of the file to see if there are any obvious free-form userdata strings in the header that would give away information
  • Run ‘ffmpeg -i <sample> -an -f image2 -vcodec copy %05d.frm’ on the sample to break up the frames into individual files
  • Observe characteristics about the sizes of each frame– if they are all the same then do some math based on the size of each frame and the resolution of the video file and try to guess the format; make other educated guesses based on frame sizes (all frames roughly the same size may indicate an intra-coded — i.e., all keyframes — codec; codec where the first frame is enormous followed by a lot of extremely small frames, combined with other intelligence, may indicate a screen capture codec, my current hypothesis for Microsoft Camcorder Video)
  • Upload samples to mphq and file appropriately; preferred strategy for samples: try to catalog at least 2 samples for a format, but no more than 5; make them each less than 5 megabytes if possible; if there is a choice, try to grab samples from different sources rather than grabbing multiple samples from one server (which were likely created with the same version of the same software using the same parameters); create readme.txt file that lists the original URLs for the files
  • Create a new MultimediaWiki page for the format; create a FourCC redirect page so that the video FourCC is automatically categorized

Also, compn demonstrates that it’s important to try forcing the video data through several common codecs, most notably ISO MPEG-4 part 2 (a.k.a. DIVX/XVID) and JPEG.

I would like to hear other basic strategies for analyzing unknown formats.

Designing A Download Strategy

The uncommon video codecs list mentioned in the last post is amazing. Here are some FourCCs I have never heard of before: 3ivd, abyr, acdv, aura, brco, bt20, bw10, cfcc, cfhd, digi, dpsh, dslv, es07, fire, g2m3, gain, geox, imm4, inmc, mohd, mplo, qivg, suvf, ty0n, xith, xplo, and zdsv. There are several that have been found to be variations of other codecs. And there are some that were only rumored to exist, such as aflc as a codec for storing FLIC data in an AVI container, and azpr as an alternate FourCC for rpza. We now have samples. The existence of many of these FourCCs has, in fact, been cataloged on FourCC.org. But I was always reticent to document the FourCCs in the MultimediaWiki unless I could find either samples or a binary codec.

But how to obtain all of these samples?

Do you ever download files from the internet? Of course you do. Do you ever download a bunch of files at a time? Maybe. But have you ever had to download a few thousand files?

I have some experience to guide me in this. Continue reading