Processing The Unknowns

This is the general process I have been using for working through the unknown video codec samples (but not always in this order):

  • Starting with the FourCC (which is usually how the samples are sorted thanks to my download method), look up codec in the MultimediaWiki to see if something is already known
  • Check the mphq archive to see if similar examples are already cataloged in the V-codecs directory
  • Check the FourCC list to see if they have any knowledge about the codec
  • Consult Google
  • Study the raw bytes of the file to see if there are any obvious free-form userdata strings in the header that would give away information
  • Run ‘ffmpeg -i <sample> -an -f image2 -vcodec copy %05d.frm’ on the sample to break up the frames into individual files
  • Observe characteristics about the sizes of each frame– if they are all the same then do some math based on the size of each frame and the resolution of the video file and try to guess the format; make other educated guesses based on frame sizes (all frames roughly the same size may indicate an intra-coded — i.e., all keyframes — codec; codec where the first frame is enormous followed by a lot of extremely small frames, combined with other intelligence, may indicate a screen capture codec, my current hypothesis for Microsoft Camcorder Video)
  • Upload samples to mphq and file appropriately; preferred strategy for samples: try to catalog at least 2 samples for a format, but no more than 5; make them each less than 5 megabytes if possible; if there is a choice, try to grab samples from different sources rather than grabbing multiple samples from one server (which were likely created with the same version of the same software using the same parameters); create readme.txt file that lists the original URLs for the files
  • Create a new MultimediaWiki page for the format; create a FourCC redirect page so that the video FourCC is automatically categorized

Also, compn demonstrates that it’s important to try forcing the video data through several common codecs, most notably ISO MPEG-4 part 2 (a.k.a. DIVX/XVID) and JPEG.

I would like to hear other basic strategies for analyzing unknown formats.

3 thoughts on “Processing The Unknowns

  1. Multimedia Mike Post author

    Thanks for the tip. I’m playing with it right now. So far, I haven’t seen it provide any data that I couldn’t find elsewhere. But it might streamline the overall process.

  2. Kostya

    JFIF data is usually recognisable at once. As for H.26x variants including MPEG-4 either try forcing codec or take an apprenticeship with the guru and you’ll be able to recognise them by first bytes too :)

    Also frames starting with 0x78 byte (‘x’) may be compressed with deflate (that reminds me of the first video codec I’ve REd).

Comments are closed.