I was recently processing a huge corpus of data. It went like this: For each file in a large set, run
'cmdline-tool <file>', capture the output and log results to a database, including whether the tool crashed. I wrote it in Python. I have done this exact type of the thing enough times in Python that I’m starting to notice a pattern.
Every time I start writing such a program, I always begin with using Python’s commands module because it’s the easiest thing to do. Then I always have to abandon the module when I remember the hard way that whatever ‘cmdline-tool’ is, it might run errant and try to execute forever. That’s when I import (rather, copy over) my process runner from FATE, the one that is able to kill a process after it has been running too long. I have used this module enough times that I wonder if I should spin it off into a new Python module.
Or maybe I’m going about this the wrong way. Perhaps when the data set reaches a certain size, I’m really supposed to throw it on some kind of distributed cluster rather than task it to a Python script (a multithreaded one, to be sure, but one that runs on a single machine). Running the job on a distributed architecture wouldn’t obviate the need for such early termination. But hopefully, such architectures already have that functionality built in. It’s something to research in the new year.
I guess there are also process limits, enforced by the shell. I don’t think I have ever gotten those to work correctly, though.
When I announced that I had transitioned my VP8 encoder’s status from “toy” to “working”, Jim L. lamented the loss of humorous posts about oddly encoded images output from my encoder. Not so! There are still plenty of features that I have yet to implement, each of which carries the possibility of bizarre images.
For example, I dusted off my work-in-progress intra 4×4 encoding, fixed a few of the more obvious bugs, and told the encoder to encode the first block in 4×4 mode and the rest in the usual, working, debugged 16×16 mode. The results of the first pass surprised me:
The reason this surprised me was that I intuitively expected one of 2 outcomes:
- Perfect image right away since everything is correct (very unlikely but not outside the realm of possibility)
- Total garbage with, at most, the first macroblock looking somewhat legible; this would be due to having some of the first macroblock correct but completely desynchronizing the bitstream for the purpose of decoding the rest of the coefficients.
I absolutely did not expect the first macroblock to look messed up but for the rest of the picture to look fine. For fun, I reversed the logic and encoded the first block as 16×16 and the rest with the experimental 4×4 mode:
If you examine carefully, you will see that the color planes are correct (though faint). There just isn’t much going on in the luma plane. This made sense when I noticed the encoder was encoding a blank (undefined, actually) set of luma coefficients for 4×4 mode macroblocks due to a bug. This helps to rationalize the first image as well– the first macroblock was encoding nonsense for the first macroblock which messed up the macroblocks which immediately surrounded it. Eventually, macroblock decoding got back on track when the prediction modes weren’t relying on the errantly decoded macroblocks.
After I fixed that bug, I let the 4×4 mode rip through the whole image. That’s when I got what I am terming the “dark and gritty reboot of Big Buck Bunny”:
Fortunately, this also turned out to be traceable to a pretty obvious code bug.
One day, this VP8 encoder might do the right thing while implementing all of the algorithm’s features. In the meantime, it’s at least entertaining to watch it make mistakes.