Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Archives:

ANSI Code Coverage Followup

March 8th, 2012 by Multimedia Mike

The people behind sixteencolors.net noticed my code coverage project concerning the ANSI video decoder and asked what they could do to help. I had already downloaded 350 / 4000 of their artpacks but didn’t want to download the remainder if I could avoid it. They offered to run my tool against their local collection of files.

Aside: They have all of the artpacks archived at Github.

The full corpus of nearly 4000 artpacks contains over 146,000 files. Versus my sampling of 350 artpacks and 13,000 files that covered all but 45 lines of the ansi.c source file, the full corpus has files to exercise… 6 more of those lines. Whee. This means that there are files which exercise the reverse and concealed attributes, all 3 “erase in line” modes, and one more error path (which probably wasn’t a valid file anyway).

Missing features mostly cluster around different video modes, including: 320×200 (25 rows), 640×200 (25 rows), 640×350 (43 rows), and 640×480 (60 rows); on the plus side, nothing tripped the “unsupported screen mode” case. There are no files that switch modes during playback.

I guess statistical sampling theory holds out here– a small set of randomly chosen files would do a fine job covering code. But this experiment is about finding the statistical outliers.

Posted in Programming | No Comments »

Finding Optimal Code Coverage

March 6th, 2012 by Multimedia Mike

A few months ago, I published a procedure for analyzing code coverage of the test suites exercised in FFmpeg and Libav. I used it to add some more tests and I have it on good authority that it has helped other developers fill in some gaps as well (beginning with students helping out with the projects as part of the Google Code-In program). Now I’m wondering about ways to do better.

Current Process
When adding a test that depends on a sample (like a demuxer or decoder test), it’s ideal to add a sample that’s A) small, and B) exercises as much of the codebase as possible. When I was studying code coverage statistics for the WC4-Xan video decoder, I noticed that the sample didn’t exercise one of the 2 possible frame types. So I scouted samples until I found one that covered both types, trimmed the sample down, and updated the coverage suite.

I started wondering about a method for finding the optimal test sample for a given piece of code, one that exercises every code path in a module. Okay, so that’s foolhardy in the vast majority of cases (although I was able to add one test spec that pushed a module’s code coverage from 0% all the way to 100% — but the module in question only had 2 exercisable lines). Still, given a large enough corpus of samples, how can I find the smallest set of samples that exercise the complete codebase?

This almost sounds like an NP-complete problem. But why should that stop me from trying to find a solution?

Science Project
Here’s the pitch:

  • Instrument FFmpeg with code coverage support
  • Download lots of media to exercise a particular module
  • Run FFmpeg against each sample and log code coverage statistics
  • Distill the resulting data in some meaningful way in order to obtain more optimal code coverage

That first step sounds harsh– downloading lots and lots of media. Fortunately, there is at least one multimedia format in the projects that tends to be extremely small: ANSI. These are files that are designed to display elaborate scrolling graphics using text mode. Further, the FATE sample currently deployed for this test (TRE_IOM5.ANS) only exercises a little less than 50% of the code in libavcodec/ansi.c. I believe this makes the ANSI video decoder a good candidate for this experiment.

Procedure
First, find a site that hosts a lot ANSI files. Hi, sixteencolors.net. This site has lots (on the order of 4000) artpacks, which are ZIP archives that contain multiple ANSI files (and sometimes some other files). I scraped a list of all the artpack names.

In an effort to be responsible, I randomized the list of artpacks and downloaded periodically and with limited bandwidth ('wget --limit-rate=20k').

Run ‘gcov’ on ansi.c in order to gather the full set of line numbers to be covered.

For each artpack, unpack the contents, run the instrumented FFmpeg on each file inside, run ‘gcov’ on ansi.c, and log statistics including the file’s size, the file’s location (artpack.zip:filename), and a comma-separated list of line numbers touched.

Definition of ‘Optimal’
The foregoing procedure worked and yielded useful, raw data. Now I have to figure out how to analyze it.

I think it’s most desirable to have the smallest files (in terms of bytes) that exercise the most lines of code. To that end, I sorted the results by filesize, ascending. A Python script initializes a set of all exercisable line numbers in ansi.c, then iterates through each each file’s stats line, adding the file to the list of candidate samples if its set of exercised lines can remove any line numbers from the overall set of lines. Ideally, that set of lines should devolve to an empty set.

I think a second possible approach is to find the single sample that exercises the most code and then proceed with the previously described method.

Initial Results
So far, I have analyzed 13324 samples from 357 different artpacks provided by sixteencolors.net. Read the rest of this entry »

Posted in Programming | 2 Comments »

How Many Default Languages?

January 25th, 2012 by Multimedia Mike

I was thinking back to my childhood, when my family first owned a computer. It was an MS-DOS-powered IBM PC. The default OS came with 2 programming environments, such as they were: GW-BASIC and batch files. It was a start, I suppose. I guess most any microcomputer you can name from that era came with some kind of BASIC interpreter. That defined the computer’s “out of the box” programmability.

Then I started wondering how this compares to computers (operating systems/distributions, really) these days. So I installed a fresh version of the latest Ubuntu Linux version (11.10 as of this writing; x86_32) and looked for programmability (without installing anything else). This is what I came up with:

  1. gcc/C (only the C compiler; other components of the GNU compiler collection are installed separately)
  2. Perl
  3. Python
  4. C#, as furnished by Mono
  5. Bash — can’t forget about the shell as a full-featured programming language (sh is also present, but not t/csh)
  6. JavaScript — since Firefox is installed per default, JS counts
  7. GNU Assember — thanks to Reimar for the reminder that if gcc is present, gas necessarily needs to be there as well

I checked on C++, Objective C, Java, Ada, Fortran, Go, Lua, Ruby, Tcl, PHP, R and other languages I could think of, but the above items were the only ones present by default. At the same time, I checked my Mac OS X (10.6) box and it also has Ruby and PHP installed. It has a bunch of other languages, courtesy of Xcode, so I can’t certify anything about its out of the box programmability.

Still, I think “embarrassment of riches” pretty well sums it up. I try not to be crotchety old fogey complaining that kids these days don’t know how good they have it; rather, I’m genuinely excited for anyone who wants to leap into computer programming in this day and age.

Posted in Programming | 7 Comments »

Using lcov With FFmpeg/Libav

November 20th, 2011 by Multimedia Mike

Last year, I delved into code coverage tools and their usage with FFmpeg. I learned about using GNU gcov, which is powerful but pretty raw about the details it provides to you. I wrote a script to help interpret its output and later found another script called gcovr to do the same, only much better.

I later found another tool called lcov which is absolutely amazing for understanding code coverage of your software. I’ve been meaning to use it to further FATE test coverage for the multimedia projects.



Click for larger image

Basic Instructions
Install the lcov tool, of course. In Ubuntu, 'apt-get install lcov' will do the trick.

Build the project with code coverage support, i.e.,

./configure --enable-gpl --samples=/path/to/fate/samples \
 --extra-cflags="-fprofile-arcs -ftest-coverage" \
 --extra-ldflags="-fprofile-arcs -ftest-coverage"
make

Clear the coverage data:

lcov --directory . --zerocounters

Run the software (in this case, the FATE test suite):

make fate

Let lcov work its magic:

lcov --directory . --capture --output-file coverage.info
mkdir html-output
genhtml -o html-output coverage.info

At this point, you can aim your web browser at html-output/index.html to learn everything you could possibly want to know about code coverage of the test suite. You can sort various columns in order to see which modules have the least code coverage. You can drill into individual source files and see highlighted markup demonstrating which lines have been executed.

As you can see from the screenshot above, FFmpeg / Libav are not anywhere close to full coverage. But lcov provides an exquisite roadmap.

Posted in Programming | 8 Comments »

What’s So Hard About Building?

September 9th, 2011 by Multimedia Mike

I finally had a revelation as to why so building software can be so difficult– because build systems are typically built on programming languages that you don’t normally use in your day to day programming activities. If the project is simple enough, the build system usually takes care of the complexities. If there are subtle complexities — and there always are — then you have to figure out how to customize the build system to meet your needs.

First, there’s the Makefile. It’s easy to forget that the syntax which comprises a Makefile pretty well qualifies as a programming language. I wonder if it’s Turing-complete? But writing and maintaining Makefiles manually is arduous and many systems have been created to generate Makefiles for you. At the end of the day, running ‘make’ still requires the presence of a Makefile and in the worst case scenario, you’re going to have to inspect and debug what was automatically generated for that Makefile.

So there is the widespread GNU build system, a.k.a., “the autotools”, named due to its principle components such as autoconf and automake. In this situation, you have no fewer than 3 distinct languages at work. You write your general build instructions using a set of m4 macros (language #1). These get processed by the autotools in order to generate a shell script (language #2) called configure. When this is executed by the user, it eventually generates a Makefile (language #3).

Over the years, a few challengers have attempted to dethrone autotools. One is CMake which configures a project using its own custom programming language that you will need to learn. Configuration generates a standard Makefile. So there are 2 languages involved in this approach.

Another option is SCons, which is Python-based, top to bottom. Only one programming language is involved in the build system; there’s no Makefile generated and run. Until I started writing this, I was guessing that the Python component generated a Makefile, but no.

That actually makes SCons look fairly desirable, at least if your only metric when choosing a build system is to minimize friction against rarely-used programming languages.

I should also make mention of a few others: Apache Ant is a build system in which the build process is described by an XML file. XML doesn’t qualify as a programming language (though that apparently doesn’t stop some people from using it as such). I see there’s also qmake, related to the Qt system. This system uses its own custom syntax.

Posted in Programming | 15 Comments »

Started Programming Young

September 5th, 2011 by Multimedia Mike

I have some of the strangest memories of my struggles to jump into computer programming.

Back To BASIC
I remember doing some Logo programming on Apple II computers at school in 5th grade (1987 timeframe). But that was mostly driving turtle graphics. Then I remember doing some TRS-80 BASIC in 7th grade, circa 1989. Emboldened by what very little I had learned in perhaps the week or 2 we took in a science class to do this, I tried a little GW-BASIC on my family’s “IBM-PC compatible” computer (they were still called that back then). I still remember what my first program consisted of. Even back then I was interested in manipulating graphics and color on a computer screen. Thus:

10 color 1
20 print "This is color 1"
30 color 2
40 print "This is color 2"
...

And so on through 15 colors. Hey, it did the job– it demonstrated the 15 different colors you could set in text mode.

What’s FOR For?
That 7th grade computer unit in science class wasn’t very thick on computer science details. I recall working with a lab partner to transcribe code listings into a computer (and also saving my work to a storage cassette). We also developed form processing programs that would print instructions to input text followed by an “INPUT I$” statement to obtain the user’s output.

I remember there was some situation where we needed a brief delay between input and printing. The teacher told us to use a construct of the form:

10 FOR I = 1 TO 20000
20 NEXT I

We had to calibrate the number based on our empirical assessment of how long it lasted but I recall that the number couldn’t be much higher than about 32000, for reasons that would become clearer much later.

Imagine my confusion when I would read and try to comprehend BASIC program code I would find in magazines. I would of course see that FOR..NEXT construct all over the place but obviously not in the context of introducing deliberate execution delays. Indeed, my understanding of one of the fundamental building blocks of computer programming — iteration — was completely skewed because of this early lesson.

Refactoring
Somewhere along the line, I figured out that the FOR..NEXT could be used to do the same thing a bunch of times, possibly with different values. A few years after I had written that color program, I found it again and realized that I could write it as:

10 for I = 1 to 15
20 color I
30 print I
40 next I

It still took me a few more years to sort out the meaning of WHILE..WEND, though.

Posted in Programming | 3 Comments »

The Fastest Way To Learn Assembly Language

September 3rd, 2011 by Multimedia Mike

I saw an old StackOverflow thread linked from Hacker News asking how to whether it’s worthwhile to learn assembly language and how to go about doing so. I’d like to take a stab at the last question.

The fastest way to learn an assembly language is to reverse engineer something. Seriously, start with something that you know (like a C program that you wrote yourself) and take it apart. The good news is that assembly language is very simple and you will get a lot of practice in a short amount of time with RE.

So here’s how you do it:

  • Take a simple program in C and build it with your tool chain, whether GNU gcc on Linux, Xcode on Mac, or MSVC on Windows. Also, make sure to turn on debugging symbols during compilation (this will help annotate the disassembly).
  • On Linux, use objdump: objdump -d program_binary
  • On Mac, use otool: otool -tV program_binary
  • On Windows: I admit, I’m a bit fuzzy on this one– I’m quite certain there’s a standard MSVC tool that prints the assembly listing.

Anyway, look at the disassembled code and find the main() function. Work from there. Whatever the first instruction is, look it up on Google. You’ll likely find various CPU manuals that will explain the simple operation of the instruction. Look up the next unfamiliar instruction, then the next. Trust me, you’ll become an ASM expert in no time.

Good luck!

Posted in Programming | 6 Comments »

Programming Language Levels

May 19th, 2011 by Multimedia Mike

I’ve been doing this programming thing for some 20 years now. Things sure do change. One change I ponder from time to time is the matter of programming language levels. Allow me to explain.

The 1990s
When I first took computer classes in the early 1990s, my texts would classify computer languages into 3 categories, or levels. The lower the level, the closer to the hardware; the higher the level, the more abstract (and presumably, easier to use). I recall that the levels went something like this:

  • High level: Pascal, BASIC, Logo, Fortran
  • Medium level: C, Forth
  • Low level: Assembly language

Keep in mind that these were the same texts which took the time to explain the history of computers from mainframes -> minicomputers -> a relatively recent phenomenon called microcomputers or “PCs”.

Somewhere in the mid-late 1990s, when I was at university, I was introduced to a new tier:

  • Very high level: Perl, shell scripting

I think there was some debate among my peers about whether C++ and Java were properly classified as high or very high level. The distinction between high and very high, in my observation, seemed to be that very high level languages had more complex data structures (at the very least, a hash / dictionary / associative array / key-value map) built into the language, as well as implicit memory management.

Modern Day
These days, the old hierarchy is apparently forgotten (much like minicomputers). I observe that there is generally a much simpler 2-tier classification:

  • Low level: C, assembly language
  • High level: absolutely every other programming language in wide use today

I find myself wondering where C++ and Objective-C fit in this classification scheme. Then I remember that it doesn’t matter and this is all academic.

Relevancy
I think about this because I have pretty much stuck to low-level programming all of my life, mostly due to my interest in game and multimedia-type programming. But the trends in computing have favored many higher level languages and programming paradigms. I woke up one day and realized that the kind of work I often do — lower level stuff — is not very common.

I’m not here to argue that low or high level is superior. You know I’m all about using the appropriate tool for the job. But I sometimes find myself caught between worlds, having the defend and explain one to the other.

  • On one hand, it’s not unusual for the multitudes of programmers working at the high level to gasp and wonder why I or anyone else would ever use C or assembly language for anything when there are so many beautiful high level languages. I patiently explain that those languages have to be written in some other language (at first) and that they need to run on some operating system and that most assuredly won’t be written in a high level language. For further reading, I refer them to Joel Spolsky’s great essay called Back to Basics which describes why it can be useful to know at least a little bit about how the computer does what it does at the lowest levels.
  • On the other hand, believe it or not, I sometimes have to defend the merits of high level languages to my low level brethren. I’ll often hear variations of, “Any program can be written in C. Using a high level language to achieve the same will create a slow and bloated solution.” I try to explain that the trade-off in time to complete the programming task weighed against the often-negligible performance hit of what is often an I/O-bound operation in the first place makes it worthwhile to use the high level language for a wide variety of tasks.

    Or I just ignore them. That’s actually the best strategy.

Posted in Programming | 5 Comments »

Of ctors and dtors

February 17th, 2011 by Multimedia Mike

I haven’t given up on the Sega Dreamcast programming. I was able to compile a bunch of homebrew code for the DC many years ago and I can’t make it work anymore. Again, I was working with a purpose-built, open source RTOS named KallistiOS (or KOS). I can make the programs compile but not run. I had ELF files left over from years ago which still executed. But when I tried to build new ELF files, no luck– the programs crashed before even reaching my main() function.

I found the problem: ELF files are comprised of a number of sections and 2 of these sections are named ‘.ctors’ and ‘.dtors’ which stand for constructors and destructors. The KOS RTOS performs a manual traversal of .ctors section during program initialization and this is where things go bad. The traversal code doesn’t seem to account for a .ctors section that only contains a single entry. I commented out the function that does the traversal and programs started to work, at least until it was time to exit the program and return control to the program loader. That’s when the counterpart .dtors section traversal code ran and demonstrated the same problem. I’ll exhibit the problematic code at the end of this post.

So I’m finally tinkering with Sega Dreamcast programming once again and with a slightly better grasp of software engineering than the first time I did this.

Portable and Compatible C?
If nothing else, this low-level embedded stuff exposes you to some serious toolchain arcana, the likes of which you will likely never see working strictly in the desktop arena.

Still, this exercise makes me wonder why C code from a decade ago doesn’t compile reliably now. Part of it is because gcc has gotten stricter about the syntax it will accept. In the case of this specific crashing problem, I suspect it comes down to a difference in the way the linker generates the final ELF file. I’ve written a list of items I have had to modify in the KOS codebase in order to get it to compile on more recent gcc versions. I wonder if it would be worth publishing the specifics, or if anyone would ever find the information useful? Oh, who am I kidding? Of course I’ll write it up, perhaps publish a new version of the code, if only because that’s the best chance I have of finding my own work again some years down the road.

Problematic C Code
See if this code makes any sense to you. It somehow traverse a list of 32-bit function pointers (in different directions, depending on constructors or destructors), executing each in turn. However, it appears to fall over if the list of pointers consists of a single entry.
Read the rest of this entry »

Posted in Programming, Sega Dreamcast | 6 Comments »

Heroic Defender of the Stack

January 26th, 2011 by Multimedia Mike

Problem Statement

I have been investigating stack smashing and countermeasures (stack smashing prevention, or SSP). Briefly, stack smashing occurs when a function allocates a static array on the stack and writes past the end of it, onto other local variables and eventually onto other function stack frames. When it comes time to return from the function, the return address has been corrupted and the program ends up some place it really shouldn’t. In the best case, the program just crashes; in the worst case, a malicious party crafts code to exploit this malfunction.

Further, debugging such a problem is especially obnoxious because by the time the program has crashed, it has already trashed any record (on the stack) of how it got into the errant state.

Preventative Countermeasure

GCC has had SSP since version 4.1. The computer inserts SSP as additional code when the -fstack-protector command line switch is specified. Implementation-wise, SSP basically inserts a special value (the literature refers to this as the ‘canary’ as in “canary in the coalmine”) at the top of the stack frame when entering the function, and code before leaving the function to make sure the canary didn’t get stepped on. If something happens to the canary, the program is immediately aborted with a message to stderr about what happened. Further, gcc’s man page on my Ubuntu machine proudly trumpets that this functionality is enabled per default ever since Ubuntu 6.10.

And that’s really all there is to it. Your code is safe from stack smashing by default. Or so the hand-wavy documentation would have you believe.

Not exactly

Exercising the SSP

I wanted to see the SSP in action to make sure it was a real thing. So I wrote some code that smashes the stack in pretty brazen ways so that I could reasonably expect to trigger the SSP (see later in this post for the code). Here’s what I learned that wasn’t in any documentation: Read the rest of this entry »

Posted in Programming | 3 Comments »

« Previous Entries