Reverse Engineering Italian Literature

Some time ago, Diego “Flameeyes” Pettenò tried his hand at reverse engineering a set of really old CD-ROMs containing even older Italian literature. The goal of this RE endeavor would be to extract the useful literature along with any structural metadata (chapters, etc.) and convert it to a more open format suitable for publication at, e.g., Project Gutenberg or Archive.org.

Unfortunately, the structure of the data thwarted the more simplistic analysis attempts (like inspecting for blocks of textual data). This will require deeper RE techniques. Further frustrating the effort, however, is the fact that the binaries that implement the reading program are written for the now-archaic Windows 3.1 operating system.

In pursuit of this RE goal, I recently thought of a way to glean more intelligence using DOSBox.

Prior Work
There are 6 discs in the full set (distributed along with 6 sequential issues of a print magazine named L’Espresso). Analysis of the contents of the various discs reveals that many of the files are the same on each disc. It was straightforward to identify the set of files which are unique on each disc. This set of files all end with the extension “LZn”, where n = 1..6 depending on the disc number. Further, the root directory of each disc has a file indicating the sequence number (1..6) of the CD. Obviously, these are the interesting targets.

The LZ file extensions stand out to an individual skilled in the art of compression– could it be a variation of the venerable LZ compression? That’s actually unlikely because LZ — also seen as LIZ — stands for Letteratura Italiana Zanichelli (Zanichelli’s Italian Literature).

The Unix ‘file’ command was of limited utility, unable to plausibly identify any of the files.

Progress was stalled.

Saying Hello To An Old Frenemy
I have been showing this screenshot to younger coworkers to see if any of them recognize it:


DOSBox running Window 3.1

Not a single one has seen it before. Senior computer citizen status: Confirmed.

I recently watched an Ancient DOS Games video about Windows 3.1 games. This episode showed Windows 3.1 running under DOSBox. I had heard this was possible but that it took a little work to get running. I had a hunch that someone else had probably already done the hard stuff so I took to the BitTorrent networks and quickly found a download that had the goods ready to go– a directory of Windows 3.1 files that just had to be dropped into a DOSBox directory and they would be ready to run.

Aside: Running OS software procured from a BitTorrent network? Isn’t that an insane security nightmare? I’m not too worried since it effectively runs under a sandboxed virtual machine, courtesy of DOSBox. I suppose there’s the risk of trojan’d OS software infecting binaries that eventually leave the sandbox.

Using DOSBox Like ‘strace’
strace is a tool available on some Unix systems, including Linux, which is able to monitor the system calls that a program makes. In reverse engineering contexts, it can be useful to monitor an opaque, binary program to see the names of the files it opens and how many bytes it reads, and from which locations. I have written examples of this before (wow, almost 10 years ago to the day; now I feel old for the second time in this post).

Here’s the pitch: Make DOSBox perform as strace in order to serve as a platform for reverse engineering Windows 3.1 applications. I formed a mental model about how DOSBox operates — abstracted file system classes with methods for opening and reading files — and then jumped into the source code. Sure enough, the code was exactly as I suspected and a few strategic print statements gave me the data I was looking for.

Eventually, I even took to running DOSBox under the GNU Debugger (GDB). This hasn’t proven especially useful yet, but it has led to an absurd level of nesting:


GDB runs DOSBox runs Windows 3.1

The target application runs under Windows 3.1, which is running under DOSBox, which is running under GDB. This led to a crazy situation in which DOSBox had the mouse focus when a GDB breakpoint was triggered. At this point, DOSBox had all desktop input focus and couldn’t surrender it because it wasn’t running. I had no way to interact with the Linux desktop and had to reboot the computer. The next time, I took care to only use the keyboard to navigate the application and trigger the breakpoint and not allow DOSBox to consume the mouse focus.

New Intelligence
Continue reading

Playing With Emscripten and ASM.js

The last 5 years or so have provided a tremendous amount of hype about the capabilities of JavaScript. I think it really kicked off when Google announced their Chrome web browser in September, 2008 along with its V8 JS engine. This seemed to spark an arms race in JS engine performance along with much hyperbole that eventually all software could, would, and/or should be written in straight JavaScript for maximum portability and future-proofing, perhaps aided by Emscripten, a tool which magically transforms C and C++ code into JS. The latest round of rhetoric comes courtesy of something called asm.js which purports to narrow the gap between JS and native code performance.

I haven’t been a believer, to express it charitably. But I wanted to be certain, so I set out to devise my own experiment to test modern JS performance.

Up Front Summary
I was extremely surprised that my experiment demonstrated JS performance FAR beyond my expectations. There might be something to these claims of magnficent JS speed in numerical applications. Basically, here were my thoughts during the process:

  • There’s no way that JavaScript can come anywhere close to C performance for a numerically intensive operation; a simple experiment should demonstrate this.
  • Here’s a straightforward C program to perform a simple yet numerically intensive operation.
  • Let’s compile the C program on gcc and get some baseline performance numbers.
  • Let’s use Emscripten to convert the C program to JavaScript and run it under Chrome.
  • Ha! Pitiful JS performance, just as I expected!
  • Try the same program under Firefox, since Firefox is supposed to have some crazy optimization for asm.js code, allegedly emitted by Emscripten.
  • LOL! Firefox performs even worse than Chrome!
  • Wait a minute… the Emscripten documentation mentioned using optimization levels for generating higher performance JS, so try ‘-O1’.
  • Umm… wow: Chrome’s performance increased dramatically! What about Firefox? Not only is Firefox faster than Chrome, it’s faster than the gcc-generated code!
  • As my faith in C is suddenly shaken to its core, I remembered to compile the gcc version with an explicit optimization level. The native C version pulled ahead of Firefox again, but the Firefox code is still close.
  • Aha! This is just desktop– but what about mobile? One of the leading arguments for converting everything to pure JavaScript is that such programs will magically run perfectly in mobile browsers. So I wager that this is where the experiment will fall over.
  • I proceed to try the same converted program on a variety of mobile platforms.
  • The mobile platforms perform rather admirably as well.
  • I am surprised.

The Experiment
I wanted to run a simple yet numerically-intensive and relevant benchmark, and something I am familiar with. I settled on JPEG image decoding. Again, I wanted to keep this simple, ideally in a single file because I didn’t know how hard it might be to deal with Emscripten. I found NanoJPEG, which is a straightforward JPEG decoder contained in a single C file.
Continue reading

Long Overdue MediaWiki Upgrade

What do I do? What I do? This library book is 42 years overdue!
I admit that it’s mine, yet I can’t pay the fine,
Should I turn it in or should I hide it again?
What do I do? What do I do?

I internalized the forgoing paean to the perils of procrastination by Shel Silverstein in my formative years. It’s probably why I’ve never paid a single cent in late fees in my entire life.

However, I have been woefully negligent as the steward of the MediaWiki software that drives the world famous MultimediaWiki, the internet’s central repository of obscure technical knowledge related to multimedia. It is currently running of version 1.6 software. The latest version is 1.22.

The Story So Far
According to my records, I first set up the wiki late in 2005. I don’t know which MediaWiki release I was using at the time. I probably conducted a few upgrades in the early days, but that went by the wayside perhaps in 2007. My web host stopped allowing shell access and the MediaWiki upgrade process pretty much requires running a PHP script from a command line. Upgrade time came around and I put off the project. Weeks turned into months turned into years until, according to some notes, the wiki abruptly stopped working in July, 2011. Suddenly, there were PHP errors about “Namespace” being a reserved word.

While I finally laid out a plan to upgrade the wiki after all these years, I eventually found that the problem had been caused when my webhost upgraded from PHP 5.2 -> 5.3. I also learned of a small number of code changes that caused the problem to go away, thus kicking the can down the road once more.

Then a new problem showed up last week. I think it might be related to a new version of PHP again. This time, a few other things on my site broke, and I learned that my webhost now allows me to select a PHP version to use (with the version then set to “auto”, which didn’t yield much information). Rolling back to an earlier version of PHP might have solved the problem easily.

But NO! I made the determination that this goes no further. I want this wiki upgraded.

The Arduous Upgrade Path
There are 2 general upgrade paths I can think of:
Continue reading

Chrome’s New Audio Notifier

Version 32 of Google’s Chrome web browser introduced this nifty feature:


Chrome audio notifier icon

When a browser tab has an element that is producing audio, the browser’s tab shows the above audio notification icon to inform the user. I have seen that people have a few questions about this, specifically:

  1. How does this feature work?
  2. Why wasn’t this done sooner?
  3. Are other browsers going to follow suit?

Short answers: 1) Chrome offers a new plugin API that the Flash Player is now using, as are Chrome’s internal media playing facilities; 2) this feature was contingent on the new plugin infrastructure mentioned in the previous answer; 3) other browsers would require the same infrastructure support.

Longer answers follow…

Plugin History
Plugins were originally based on the Netscape Plugin API. This was developed in the early 1990s in order to support embedding PDFs into the Netscape web browser. The NPAPI does things like providing graphics contexts for drawing and input processing, and mediate network requests through the browser’s network facilities.

What NPAPI doesn’t do is handle audio. In the early-mid 1990s, audio support was not a widespread consideration in the consumer PC arena. Due to the lack of audio API support, if a plugin wanted to play audio, it had to go outside of the plugin framework.


NPAPI plugin model

There are a few downsides to this approach:

So that last item hopefully answers the question of why it has been so difficult for NPAPI-supporting browsers to implement what seems like it would be simple functionality, like implementing a per-tab audio notifier.

Plugin Future
Since Google released Chrome in an effort to facilitate advancements on the client side of the internet, they have made numerous efforts to modernize various legacy aspects of web technology. These efforts include the SPDY protocol, Native Client, WebM/WebP, and something call the Pepper Plugin API (PPAPI). This is a more modern take on the classic plugin architecture to supplant the aging NPAPI:


PPAPI plugin model

Right away, we see that the job of the plugin writer is greatly simplified. Where was this API years ago when I was writing my API jungle piece?

The Linux version of Chrome was apparently the first version that packaged the Pepper version of the Flash Player (doing so fixed an obnoxious bug in the Linux Flash Player interaction with GTK). Now, it looks like Windows and Mac have followed suit. Digging into the Chrome directory on a Windows 7 installation:

AppData\Local\Google\Chrome\Application\[version]\PepperFlash\pepflashplayer.dll

This directory exists for version 31 as well, which is still hanging around my system.

So, to re-iterate: Chrome has a new plugin API that plugins use to access the audio API. Chrome knows when the API is accessed and that allows the browser to display the audio notifier on a tab.

Other Browsers
What about other browsers? “Mozilla is not interested in or working on Pepper at this time. See the Chrome Pepper pages.”