Category Archives: Programming

Adventures in Unicode

Tangential to multimedia hacking is proper metadata handling. Recently, I have gathered an interest in processing a large corpus of multimedia files which are likely to contain metadata strings which do not fall into the lower ASCII set. This is significant because the lower ASCII set intersects perfectly with my own programming comfort zone. Indeed, all of my programming life, I have insisted on covering my ears and loudly asserting “LA LA LA LA LA! ALL TEXT EVERYWHERE IS ASCII!” I suspect I’m not alone in this.

Thus, I took this as an opportunity to conquer my longstanding fear of Unicode. I developed a self-learning course comprised of a series of exercises which add up to this diagram:



Part 1: Understanding Text Encoding
Python has regular strings by default and then it has Unicode strings. The latter are prefixed by the letter ‘u’. This is what ‘ö’ looks like encoded in each type.

>>> 'ö', u'ö'
('\xc3\xb6', u'\xf6')

A large part of my frustration with Unicode comes from Python yelling at me about UnicodeDecodeErrors and an inability to handle the number 0xc3 for some reason. This usually comes when I’m trying to wrap my head around an unrelated problem and don’t care to get sidetracked by text encoding issues. However, when I studied the above output, I finally understood where the 0xc3 comes from. I just didn’t understand what the encoding represents exactly.
Continue reading

ANSI Code Coverage Followup

The people behind sixteencolors.net noticed my code coverage project concerning the ANSI video decoder and asked what they could do to help. I had already downloaded 350 / 4000 of their artpacks but didn’t want to download the remainder if I could avoid it. They offered to run my tool against their local collection of files.

Aside: They have all of the artpacks archived at Github.

The full corpus of nearly 4000 artpacks contains over 146,000 files. Versus my sampling of 350 artpacks and 13,000 files that covered all but 45 lines of the ansi.c source file, the full corpus has files to exercise… 6 more of those lines. Whee. This means that there are files which exercise the reverse and concealed attributes, all 3 “erase in line” modes, and one more error path (which probably wasn’t a valid file anyway).

Missing features mostly cluster around different video modes, including: 320×200 (25 rows), 640×200 (25 rows), 640×350 (43 rows), and 640×480 (60 rows); on the plus side, nothing tripped the “unsupported screen mode” case. There are no files that switch modes during playback.

I guess statistical sampling theory holds out here– a small set of randomly chosen files would do a fine job covering code. But this experiment is about finding the statistical outliers.

Finding Optimal Code Coverage

A few months ago, I published a procedure for analyzing code coverage of the test suites exercised in FFmpeg and Libav. I used it to add some more tests and I have it on good authority that it has helped other developers fill in some gaps as well (beginning with students helping out with the projects as part of the Google Code-In program). Now I’m wondering about ways to do better.

Current Process
When adding a test that depends on a sample (like a demuxer or decoder test), it’s ideal to add a sample that’s A) small, and B) exercises as much of the codebase as possible. When I was studying code coverage statistics for the WC4-Xan video decoder, I noticed that the sample didn’t exercise one of the 2 possible frame types. So I scouted samples until I found one that covered both types, trimmed the sample down, and updated the coverage suite.

I started wondering about a method for finding the optimal test sample for a given piece of code, one that exercises every code path in a module. Okay, so that’s foolhardy in the vast majority of cases (although I was able to add one test spec that pushed a module’s code coverage from 0% all the way to 100% — but the module in question only had 2 exercisable lines). Still, given a large enough corpus of samples, how can I find the smallest set of samples that exercise the complete codebase?

This almost sounds like an NP-complete problem. But why should that stop me from trying to find a solution?

Science Project
Here’s the pitch:

  • Instrument FFmpeg with code coverage support
  • Download lots of media to exercise a particular module
  • Run FFmpeg against each sample and log code coverage statistics
  • Distill the resulting data in some meaningful way in order to obtain more optimal code coverage

That first step sounds harsh– downloading lots and lots of media. Fortunately, there is at least one multimedia format in the projects that tends to be extremely small: ANSI. These are files that are designed to display elaborate scrolling graphics using text mode. Further, the FATE sample currently deployed for this test (TRE_IOM5.ANS) only exercises a little less than 50% of the code in libavcodec/ansi.c. I believe this makes the ANSI video decoder a good candidate for this experiment.

Procedure
First, find a site that hosts a lot ANSI files. Hi, sixteencolors.net. This site has lots (on the order of 4000) artpacks, which are ZIP archives that contain multiple ANSI files (and sometimes some other files). I scraped a list of all the artpack names.

In an effort to be responsible, I randomized the list of artpacks and downloaded periodically and with limited bandwidth ('wget --limit-rate=20k').

Run ‘gcov’ on ansi.c in order to gather the full set of line numbers to be covered.

For each artpack, unpack the contents, run the instrumented FFmpeg on each file inside, run ‘gcov’ on ansi.c, and log statistics including the file’s size, the file’s location (artpack.zip:filename), and a comma-separated list of line numbers touched.

Definition of ‘Optimal’
The foregoing procedure worked and yielded useful, raw data. Now I have to figure out how to analyze it.

I think it’s most desirable to have the smallest files (in terms of bytes) that exercise the most lines of code. To that end, I sorted the results by filesize, ascending. A Python script initializes a set of all exercisable line numbers in ansi.c, then iterates through each each file’s stats line, adding the file to the list of candidate samples if its set of exercised lines can remove any line numbers from the overall set of lines. Ideally, that set of lines should devolve to an empty set.

I think a second possible approach is to find the single sample that exercises the most code and then proceed with the previously described method.

Initial Results
So far, I have analyzed 13324 samples from 357 different artpacks provided by sixteencolors.net. Continue reading

How Many Default Languages?

I was thinking back to my childhood, when my family first owned a computer. It was an MS-DOS-powered IBM PC. The default OS came with 2 programming environments, such as they were: GW-BASIC and batch files. It was a start, I suppose. I guess most any microcomputer you can name from that era came with some kind of BASIC interpreter. That defined the computer’s “out of the box” programmability.

Then I started wondering how this compares to computers (operating systems/distributions, really) these days. So I installed a fresh version of the latest Ubuntu Linux version (11.10 as of this writing; x86_32) and looked for programmability (without installing anything else). This is what I came up with:

  1. gcc/C (only the C compiler; other components of the GNU compiler collection are installed separately)
  2. Perl
  3. Python
  4. C#, as furnished by Mono
  5. Bash — can’t forget about the shell as a full-featured programming language (sh is also present, but not t/csh)
  6. JavaScript — since Firefox is installed per default, JS counts
  7. GNU Assember — thanks to Reimar for the reminder that if gcc is present, gas necessarily needs to be there as well

I checked on C++, Objective C, Java, Ada, Fortran, Go, Lua, Ruby, Tcl, PHP, R and other languages I could think of, but the above items were the only ones present by default. At the same time, I checked my Mac OS X (10.6) box and it also has Ruby and PHP installed. It has a bunch of other languages, courtesy of Xcode, so I can’t certify anything about its out of the box programmability.

Still, I think “embarrassment of riches” pretty well sums it up. I try not to be crotchety old fogey complaining that kids these days don’t know how good they have it; rather, I’m genuinely excited for anyone who wants to leap into computer programming in this day and age.