I (Heart) Picsearch And Python

I don’t know much about Picsearch. I don’t know what differentiates them from Google’s image search. And I certainly don’t know what they’re doing scouring the internet for video. But I know what I like, and I like the fact that Picsearch has submitted back to the FFmpeg development team 3 gargantuan lists of URLs:

  1. A list of 5100+ URLs linking to videos that crash FFmpeg
  2. A list of 3200 URLs linking to videos that have relatively uncommon video codecs
  3. A list of 1600+ URLs linking to videos that have relatively uncommon audio codecs

Picsearch logo

That first list is a quality engineer’s dream come true. I was able to download a little more than 4400 of the crasher URLs. The list was collected sometime last year and the good news is that FFmpeg has fixed enough problems that over half of the alleged crashers do not crash. There are still a lot of problems but I think most of them will cluster around a small set of bugs, particularly concerning the RealMedia demuxer.

I am currently downloading the uncommon video and audio format files. Given my interests, if processing the crashers is akin the having to eat my vegetables, processing a few thousand files with heretofore unknown codecs is like dessert!

So far, the challenge here has been to both download and process the huge amount of samples efficiently. The usual “download and manually test” protocol usually followed when a problem sample is reported does not really scale in this situation. Invariably, I first try some half-hearted shell-based solutions. But… who really likes shell programming?

So I moved swiftly on to custom Python scripts for downloading and testing these files. Once I tighten up the scripts a little more and successfully process as many samples as I can, I will share them here, if only so I have a place where I can easily refer to the scripts again should I need them in the future (scripts are easily misplaced on my systems).

10 thoughts on “I (Heart) Picsearch And Python

  1. Anonymous

    I took a look at the uncommon audio codecs list. The most frequent codecs are:

    0x1100736d – What is this?
    0x7a21 – AMR (What happened to this SOC project?)
    wmav1 – should work
    truespeech – should work
    QDMC – undiscovered, no documentation
    real_288 – should work
    0x0006 – a-law, should work
    pcm_s24le – should work
    sawb – AMR
    mace3 – should work
    imc – Intel Music Coder, should work
    pcm_s32be – should work
    adpcm_swf – should work
    0x0163 – WMA lossless, undiscovered
    drms – No way
    fl64 – floating point PCM, should work
    0x0402 – Ligos Indeo Audio, undiscovered
    mp1 – MPEG layer 1, should work
    fl32 – floating point PCM, should work
    Qclq – QCELP, stuff to test
    adpcm_ct – Creative ADPCM decoder, should work
    0x00ff – AAC, should work
    pcm_s32le – should work
    0x7a22 – AMR
    0x0003 – floating point PCM, should work

    So most of the above look like good things to test.

  2. Multimedia Mike Post author

    True, uncommon != unsupported in FFmpeg. And there has already been a lot of movement on the development mailing list to map some of the video FourCCs to known codecs (apparently, there are even more aliases for MPEG-4 part 2 video).

  3. Vitor

    > I took a look at the uncommon audio codecs list. The most frequent codecs are:

    > 0×1100736d – What is this?

    This is adpcm_ima_wav, works fine with recent SVN.

  4. compn

    added a few of those codecs to mplayer and ffmpeg
    still a few left to figure out.

    i wonder if that vorbis/theora in .mov file plays on anything :)

  5. Multimedia Mike Post author

    Thanks for your work on this, compn. I have been watching your commits to files like riff.c. I will be taking those into account as I methodically add samples to the MultimediaWiki (slow process, just take it a few at a time, each evening).

  6. Multimedia Mike Post author

    QDMC is well known as an earlier incarnation of the widely-used QDM2 codec. It has been shown through reverse engineering to be similar but not strictly compatible with its successor. It is great that we have some more samples, the lack of which has traditionally impeded RE efforts.

Comments are closed.