This is the general process I have been using for working through the unknown video codec samples (but not always in this order):
- Starting with the FourCC (which is usually how the samples are sorted thanks to my download method), look up codec in the MultimediaWiki to see if something is already known
- Check the mphq archive to see if similar examples are already cataloged in the V-codecs directory
- Check the FourCC list to see if they have any knowledge about the codec
- Consult Google
- Study the raw bytes of the file to see if there are any obvious free-form userdata strings in the header that would give away information
- Run ‘ffmpeg -i <sample> -an -f image2 -vcodec copy %05d.frm’ on the sample to break up the frames into individual files
- Observe characteristics about the sizes of each frame– if they are all the same then do some math based on the size of each frame and the resolution of the video file and try to guess the format; make other educated guesses based on frame sizes (all frames roughly the same size may indicate an intra-coded — i.e., all keyframes — codec; codec where the first frame is enormous followed by a lot of extremely small frames, combined with other intelligence, may indicate a screen capture codec, my current hypothesis for Microsoft Camcorder Video)
- Upload samples to mphq and file appropriately; preferred strategy for samples: try to catalog at least 2 samples for a format, but no more than 5; make them each less than 5 megabytes if possible; if there is a choice, try to grab samples from different sources rather than grabbing multiple samples from one server (which were likely created with the same version of the same software using the same parameters); create readme.txt file that lists the original URLs for the files
- Create a new MultimediaWiki page for the format; create a FourCC redirect page so that the video FourCC is automatically categorized
Also, compn demonstrates that it’s important to try forcing the video data through several common codecs, most notably ISO MPEG-4 part 2 (a.k.a. DIVX/XVID) and JPEG.
I would like to hear other basic strategies for analyzing unknown formats.