Category Archives: General

Processing The Unknowns

This is the general process I have been using for working through the unknown video codec samples (but not always in this order):

  • Starting with the FourCC (which is usually how the samples are sorted thanks to my download method), look up codec in the MultimediaWiki to see if something is already known
  • Check the mphq archive to see if similar examples are already cataloged in the V-codecs directory
  • Check the FourCC list to see if they have any knowledge about the codec
  • Consult Google
  • Study the raw bytes of the file to see if there are any obvious free-form userdata strings in the header that would give away information
  • Run ‘ffmpeg -i <sample> -an -f image2 -vcodec copy %05d.frm’ on the sample to break up the frames into individual files
  • Observe characteristics about the sizes of each frame– if they are all the same then do some math based on the size of each frame and the resolution of the video file and try to guess the format; make other educated guesses based on frame sizes (all frames roughly the same size may indicate an intra-coded — i.e., all keyframes — codec; codec where the first frame is enormous followed by a lot of extremely small frames, combined with other intelligence, may indicate a screen capture codec, my current hypothesis for Microsoft Camcorder Video)
  • Upload samples to mphq and file appropriately; preferred strategy for samples: try to catalog at least 2 samples for a format, but no more than 5; make them each less than 5 megabytes if possible; if there is a choice, try to grab samples from different sources rather than grabbing multiple samples from one server (which were likely created with the same version of the same software using the same parameters); create readme.txt file that lists the original URLs for the files
  • Create a new MultimediaWiki page for the format; create a FourCC redirect page so that the video FourCC is automatically categorized

Also, compn demonstrates that it’s important to try forcing the video data through several common codecs, most notably ISO MPEG-4 part 2 (a.k.a. DIVX/XVID) and JPEG.

I would like to hear other basic strategies for analyzing unknown formats.

We Don’t Care; We Don’t Have To

There is an old Saturday Night Live parody commercial from the later days of the U.S. phone company monopoly featuring Lily Tomlin as a phone company representative:


Lily Tomlin in the Phone Company parody SNL commercial
“So, the next time you complain about your phone service, why don’t you try using two Dixie cups with a string? We don’t care. We don’t have to. We’re the Phone Company.”

The reason I bring this up is because I participate in the FFmpeg project. FFmpeg is in a unique place among open source projects. Whereas, a common complaint against the open source paradigm is that there is too much duplicated effort among competing projects that all basically do the same thing while never matching or excelling beyond their proprietary counterparts, there is nothing else in the entirety of the software world like FFmpeg. Indeed, FFmpeg has a monopoly on do-everything multimedia manipulation programs.

Some people are distraught by this.

swfdec author Benjamin Otte has a blog post lamenting the problems of developing directly with FFmpeg. This finally prompted me to use my sucky research method against FFmpeg. The sucky research method works like this: Google for “XYZ sucks”, where XYZ is some software program, consumer product, or company in order to gauge the level of negativity against XYZ or perhaps to just commiserate with other chumps in the same boat as you. I most recently used this method to find other chumps as frustrated as me with both PHP and WordPress.

I discovered surprisingly few sites dedicated to hating FFmpeg. These stood out: FFMpeg strikes (again) and ffmpeg sucks. One comment even pointed out that there are no ffmpegsucks.tld domains registered yet, so I take that as a positive sign (hurry and register yours today!).

Most of the complaints center on the fact that there is still no central release authority or process for FFmpeg. My usual response to this is that the leadership of FFmpeg is committed to making releases eventually (this may seem non-committal but many people are still under the impression that the leadership is actively opposed to releases). It’s just that doing so takes work, planning and — get ready for it — testing. Honestly, why do you think I have been working on FATE? I want it to serve as a baseline to build confidence that the code, you know, actually works before we make any releases.

I’m not mad, though. It’s all right. I mean, seriously, what are people going to do about the situation? Refuse to use FFmpeg? Maybe fork the codebase? Heh, I dare you. FFmpeg is only as capable as the talent developing it. Better yet, is someone going to start a competing project from scratch to supplant FFmpeg? Seriously, get a grip and calm down before you hurt yourself, then we’ll talk about what we can all do together to improve FFmpeg and work toward a release schedule.

Unfortunately, we just got received a few thousand files that crash FFmpeg. That might push back the release schedule a bit. You want a reliable and secure multimedia backend library, I trust?

Related:

Designing A Download Strategy

The uncommon video codecs list mentioned in the last post is amazing. Here are some FourCCs I have never heard of before: 3ivd, abyr, acdv, aura, brco, bt20, bw10, cfcc, cfhd, digi, dpsh, dslv, es07, fire, g2m3, gain, geox, imm4, inmc, mohd, mplo, qivg, suvf, ty0n, xith, xplo, and zdsv. There are several that have been found to be variations of other codecs. And there are some that were only rumored to exist, such as aflc as a codec for storing FLIC data in an AVI container, and azpr as an alternate FourCC for rpza. We now have samples. The existence of many of these FourCCs has, in fact, been cataloged on FourCC.org. But I was always reticent to document the FourCCs in the MultimediaWiki unless I could find either samples or a binary codec.

But how to obtain all of these samples?

Do you ever download files from the internet? Of course you do. Do you ever download a bunch of files at a time? Maybe. But have you ever had to download a few thousand files?

I have some experience to guide me in this. Continue reading

I (Heart) Picsearch And Python

I don’t know much about Picsearch. I don’t know what differentiates them from Google’s image search. And I certainly don’t know what they’re doing scouring the internet for video. But I know what I like, and I like the fact that Picsearch has submitted back to the FFmpeg development team 3 gargantuan lists of URLs:

  1. A list of 5100+ URLs linking to videos that crash FFmpeg
  2. A list of 3200 URLs linking to videos that have relatively uncommon video codecs
  3. A list of 1600+ URLs linking to videos that have relatively uncommon audio codecs


Picsearch logo

That first list is a quality engineer’s dream come true. I was able to download a little more than 4400 of the crasher URLs. The list was collected sometime last year and the good news is that FFmpeg has fixed enough problems that over half of the alleged crashers do not crash. There are still a lot of problems but I think most of them will cluster around a small set of bugs, particularly concerning the RealMedia demuxer.

I am currently downloading the uncommon video and audio format files. Given my interests, if processing the crashers is akin the having to eat my vegetables, processing a few thousand files with heretofore unknown codecs is like dessert!

So far, the challenge here has been to both download and process the huge amount of samples efficiently. The usual “download and manually test” protocol usually followed when a problem sample is reported does not really scale in this situation. Invariably, I first try some half-hearted shell-based solutions. But… who really likes shell programming?

So I moved swiftly on to custom Python scripts for downloading and testing these files. Once I tighten up the scripts a little more and successfully process as many samples as I can, I will share them here, if only so I have a place where I can easily refer to the scripts again should I need them in the future (scripts are easily misplaced on my systems).