How To Process Unknown Multimedia Samples | Breaking Eggs And Making Omelettes

This is the general process I have been using for working through the unknown video codec samples (but not always in this order):

Starting with the FourCC (which is usually how the samples are sorted thanks to my download method), look up codec in the MultimediaWiki to see if something is already known
Check the mphq archive to see if similar examples are already cataloged in the V-codecs directory
Check the FourCC list to see if they have any knowledge about the codec
Consult Google
Study the raw bytes of the file to see if there are any obvious free-form userdata strings in the header that would give away information
Run ‘ffmpeg -i <sample> -an -f image2 -vcodec copy %05d.frm’ on the sample to break up the frames into individual files
Observe characteristics about the sizes of each frame– if they are all the same then do some math based on the size of each frame and the resolution of the video file and try to guess the format; make other educated guesses based on frame sizes (all frames roughly the same size may indicate an intra-coded — i.e., all keyframes — codec; codec where the first frame is enormous followed by a lot of extremely small frames, combined with other intelligence, may indicate a screen capture codec, my current hypothesis for Microsoft Camcorder Video)
Upload samples to mphq and file appropriately; preferred strategy for samples: try to catalog at least 2 samples for a format, but no more than 5; make them each less than 5 megabytes if possible; if there is a choice, try to grab samples from different sources rather than grabbing multiple samples from one server (which were likely created with the same version of the same software using the same parameters); create readme.txt file that lists the original URLs for the files
Create a new MultimediaWiki page for the format; create a FourCC redirect page so that the video FourCC is automatically categorized

Also, compn demonstrates that it’s important to try forcing the video data through several common codecs, most notably ISO MPEG-4 part 2 (a.k.a. DIVX/XVID) and JPEG.

I would like to hear other basic strategies for analyzing unknown formats.

3 thoughts on “Processing The Unknowns”

Steve December 12, 2008 at 8:17 pm

Mike- you could try MediaInfo- I’m guessing they are using the FourCC.org database, because they seem to know about some truly out there formats. http://mediainfo.sourceforge.net/

For any filetype it knows, it can provide an insane amount of information.
Multimedia Mike Post authorDecember 12, 2008 at 9:58 pm

Thanks for the tip. I’m playing with it right now. So far, I haven’t seen it provide any data that I couldn’t find elsewhere. But it might streamline the overall process.
Kostya December 13, 2008 at 12:26 am

JFIF data is usually recognisable at once. As for H.26x variants including MPEG-4 either try forcing codec or take an apprenticeship with the guru and you’ll be able to recognise them by first bytes too :)

Also frames starting with 0x78 byte (‘x’) may be compressed with deflate (that reminds me of the first video codec I’ve REd).

Comments are closed.