Tag Archives: Python

I (Heart) Picsearch And Python

I don’t know much about Picsearch. I don’t know what differentiates them from Google’s image search. And I certainly don’t know what they’re doing scouring the internet for video. But I know what I like, and I like the fact that Picsearch has submitted back to the FFmpeg development team 3 gargantuan lists of URLs:

  1. A list of 5100+ URLs linking to videos that crash FFmpeg
  2. A list of 3200 URLs linking to videos that have relatively uncommon video codecs
  3. A list of 1600+ URLs linking to videos that have relatively uncommon audio codecs


Picsearch logo

That first list is a quality engineer’s dream come true. I was able to download a little more than 4400 of the crasher URLs. The list was collected sometime last year and the good news is that FFmpeg has fixed enough problems that over half of the alleged crashers do not crash. There are still a lot of problems but I think most of them will cluster around a small set of bugs, particularly concerning the RealMedia demuxer.

I am currently downloading the uncommon video and audio format files. Given my interests, if processing the crashers is akin the having to eat my vegetables, processing a few thousand files with heretofore unknown codecs is like dessert!

So far, the challenge here has been to both download and process the huge amount of samples efficiently. The usual “download and manually test” protocol usually followed when a problem sample is reported does not really scale in this situation. Invariably, I first try some half-hearted shell-based solutions. But… who really likes shell programming?

So I moved swiftly on to custom Python scripts for downloading and testing these files. Once I tighten up the scripts a little more and successfully process as many samples as I can, I will share them here, if only so I have a place where I can easily refer to the scripts again should I need them in the future (scripts are easily misplaced on my systems).

Naive x86 Re-targeter

Here’s the complete, do-it-yourself instructions and code for that re-targeting experiment. First, the files:

The re-targeter wants to process code like that found in function.txt. This is the disassembly format output by my favorite Win32 PE disassembler, Sang Cho’s Disassembler. I knew in advance that the function expects 4 parameters, and that fact is hardcoded in the re-targeter along with the file name. The testbench.c file contains the opcodes for the original function and allows the programmer to switch between the original opcodes and the re-targeted code for verification.

To run this experiment:

  • download all 5 files into one (Unix, x86) directory
  • ./unnamed-re-project.py > bitreader.c
  • gcc *.c -o testbench

The testbench.c program simulates the data structure that the re-targeted bitreading function expects, along with a bitstream that looks like 0xA5, 0x5A repeated. Running the program should result in:

 12 bits = A55
  4 bits = A

If compiling on x86_32, you can switch the “#if 0” to “#if 1” in order to test the original opcodes.

I took a stab at making the re-targeter portable to a big endian platform, as brainstormed last night. However, I soon realized what my hastily scrawled, year-old note about that task’s difficulty must have warned about– the outlined approach works for source arguments, but is not as straightforward for destination arguments.

I don’t have immediate access to an x86_64/Linux environment, but I would like to know if the re-targeted code compiles on that platform. The entire point of a re-targeter is to run code on a different platform, though I suppose alternate operating systems on the same CPU architecture is another interpretation.

So, of course the re-targeter is super-naive in its current form. It only implements enough instructions to handle that one function in function.txt. It does not handle branching at all. In order to do so, each of the instruction emitters would also need to output the code that adjusts the appropriate flags after each arithmetic instruction. Then, a branch would map to a simple goto with the correct address label.

Related Posts

Implementing The Re-targeter

It was nearly a year ago that I tried my hand at writing a re-targeter — a program that can take machine opcodes and automatically translate them into a portable C program, which certainly sounds simple and intuitive enough. I was really quite busy last year about this time and I don’t remember how I found time for the re-targeter experiment in the first place. But it looks like I had time to write up some notes that I never fleshed out and published. It was hard enough just to locate the old source code. I was completely surprised to find that I had actually managed to write the re-targeter in Python; I had no idea I knew so much of that language (which, granted, isn’t much).

Here are some of the problems I encountered when I took a stab at writing a re-targeter; let’s see if I can remember the specifics a year later:

Continue reading