Reverse Engineering Italian Literature

Some time ago, Diego “Flameeyes” Pettenò tried his hand at reverse engineering a set of really old CD-ROMs containing even older Italian literature. The goal of this RE endeavor would be to extract the useful literature along with any structural metadata (chapters, etc.) and convert it to a more open format suitable for publication at, e.g., Project Gutenberg or Archive.org.

Unfortunately, the structure of the data thwarted the more simplistic analysis attempts (like inspecting for blocks of textual data). This will require deeper RE techniques. Further frustrating the effort, however, is the fact that the binaries that implement the reading program are written for the now-archaic Windows 3.1 operating system.

In pursuit of this RE goal, I recently thought of a way to glean more intelligence using DOSBox.

Prior Work
There are 6 discs in the full set (distributed along with 6 sequential issues of a print magazine named L’Espresso). Analysis of the contents of the various discs reveals that many of the files are the same on each disc. It was straightforward to identify the set of files which are unique on each disc. This set of files all end with the extension “LZn”, where n = 1..6 depending on the disc number. Further, the root directory of each disc has a file indicating the sequence number (1..6) of the CD. Obviously, these are the interesting targets.

The LZ file extensions stand out to an individual skilled in the art of compression– could it be a variation of the venerable LZ compression? That’s actually unlikely because LZ — also seen as LIZ — stands for Letteratura Italiana Zanichelli (Zanichelli’s Italian Literature).

The Unix ‘file’ command was of limited utility, unable to plausibly identify any of the files.

Progress was stalled.

Saying Hello To An Old Frenemy
I have been showing this screenshot to younger coworkers to see if any of them recognize it:


DOSBox running Window 3.1

Not a single one has seen it before. Senior computer citizen status: Confirmed.

I recently watched an Ancient DOS Games video about Windows 3.1 games. This episode showed Windows 3.1 running under DOSBox. I had heard this was possible but that it took a little work to get running. I had a hunch that someone else had probably already done the hard stuff so I took to the BitTorrent networks and quickly found a download that had the goods ready to go– a directory of Windows 3.1 files that just had to be dropped into a DOSBox directory and they would be ready to run.

Aside: Running OS software procured from a BitTorrent network? Isn’t that an insane security nightmare? I’m not too worried since it effectively runs under a sandboxed virtual machine, courtesy of DOSBox. I suppose there’s the risk of trojan’d OS software infecting binaries that eventually leave the sandbox.

Using DOSBox Like ‘strace’
strace is a tool available on some Unix systems, including Linux, which is able to monitor the system calls that a program makes. In reverse engineering contexts, it can be useful to monitor an opaque, binary program to see the names of the files it opens and how many bytes it reads, and from which locations. I have written examples of this before (wow, almost 10 years ago to the day; now I feel old for the second time in this post).

Here’s the pitch: Make DOSBox perform as strace in order to serve as a platform for reverse engineering Windows 3.1 applications. I formed a mental model about how DOSBox operates — abstracted file system classes with methods for opening and reading files — and then jumped into the source code. Sure enough, the code was exactly as I suspected and a few strategic print statements gave me the data I was looking for.

Eventually, I even took to running DOSBox under the GNU Debugger (GDB). This hasn’t proven especially useful yet, but it has led to an absurd level of nesting:


GDB runs DOSBox runs Windows 3.1

The target application runs under Windows 3.1, which is running under DOSBox, which is running under GDB. This led to a crazy situation in which DOSBox had the mouse focus when a GDB breakpoint was triggered. At this point, DOSBox had all desktop input focus and couldn’t surrender it because it wasn’t running. I had no way to interact with the Linux desktop and had to reboot the computer. The next time, I took care to only use the keyboard to navigate the application and trigger the breakpoint and not allow DOSBox to consume the mouse focus.

New Intelligence

By instrumenting the local file class (virtual HD files) and the ISO file class (CD-ROM files), I was able to watch which programs and dynamic libraries are loaded and which data files the code cares about. I was able to narrow down the fact that the most interesting programs are called LEGGENDO.EXE (‘reading’) and LEGGENDA.EXE (‘legend’; this has been a great Italian lesson as well as RE puzzle). The first calls the latter, which displays this view of the data we are trying to get at:


LIZ: Authors index

When first run, the program takes an interest in a file called DBBIBLIO (‘database library’, I suspect):

=== Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x0
=== Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x151
=== Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x2A2
[...]

While we were unable to sort out all of the data files in our cursory investigation, a few things were obvious. The structure of this file looked to contain 336-byte records. Turns out I was off by 1– the records are actually 337 bytes each. The count of records read from disc is equal to the number of items shown in the UI.

Next, the program is interested in a few more files:

*** isoFile(): 'DEPOSITO\BLOKCTC.LZ1', offset 0x27D6000, 2911488 bytes large
=== Read('DEPOSITO\BLOKCTC.LZ1'): req 96 bytes; read 96 bytes from pos 0x0
*** isoFile(): 'DEPOSITO\BLOKCTX0.LZ1', offset 0x2A9D000, 17152 bytes large
=== Read('DEPOSITO\BLOKCTX0.LZ1'): req 128 bytes; read 128 bytes from pos 0x0
=== Seek('DEPOSITO\BLOKCTX0.LZ1'): seek 384 (0x180) bytes, type 0
=== Read('DEPOSITO\BLOKCTX0.LZ1'): req 256 bytes; read 256 bytes from pos 0x180
=== Seek('DEPOSITO\BLOKCTC.LZ1'): seek 1152 (0x480) bytes, type 0
=== Read('DEPOSITO\BLOKCTC.LZ1'): req 32 bytes; read 32 bytes from pos 0x480
=== Read('DEPOSITO\BLOKCTC.LZ1'): req 1504 bytes; read 1504 bytes from pos 0x4A0
[...]

Eventually, it becomes obvious that BLOKCTC has the juicy meat. There are 32-byte records followed by variable-length encoded text sections. Since there is no text to be found in these files, the text is either compressed, encrypted, or both. Some rough counting (the program seems to disable copy/paste, which thwarts more precise counting), indicates that the text size is larger than the data chunks being read from disc, so compression seems likely. Encryption isn’t out of the question (especially since the program deems it necessary to disable copy and pasting of this public domain literary data), and if it’s in use, that means the key is being read from one of these files.

Blocked On Disassembly
So I’m a bit blocked right now. I know exactly where the data lives, but it’s clear that I need to reverse engineer some binary code. The big problem is that I have no idea how to disassemble Windows 3.1 binaries. These are NE-type executable files. Disassemblers abound for MZ files (MS-DOS executables) and PE files (executables for Windows 95 and beyond). NE files get no respect. It’s difficult (but not impossible) to even find data about the format anymore, and details are incomplete. It should be noted, however, the DOSBox-as-strace method described here lends insight into how Windows 3.1 processes NE-type EXEs. You can’t get any more authoritative than that.

So far, I have tried the freeware version of IDA Pro. Unfortunately, I haven’t been able to get the program to work on my Windows machine for a long time. Even if I could, I can’t find any evidence that it actually supports NE files (the free version specifically mentions MZ and PE, but does not mention NE or LE).

I found an old copy of Borland’s beloved Turbo Assembler and Debugger package. It has Turbo Debugger for Windows, both regular and 32-bit versions. Unfortunately, the normal version just hangs Windows 3.1 in DOSBox. The 32-bit Turbo Debugger loads just fine but can’t load the NE file.

I’ve also wondered if DOSBox contains any advanced features for trapping program execution and disassembling. I haven’t looked too deeply into this yet.

Future Work
NE files seem to be the executable format that time forgot. I have a crazy brainstorm about repacking NE files as MZ executables so that they could be taken apart with an MZ disassembler. But this will take some experimenting.

If anyone else has any ideas about ripping open these binaries, I would appreciate hearing them.

And I guess I shouldn’t be too surprised to learn that all the literature in this corpus is already freely available and easily downloadable anyway. But you shouldn’t be too surprised if that doesn’t discourage me from trying to crack the format that’s keeping this particular copy of the data locked up.

15 thoughts on “Reverse Engineering Italian Literature

  1. Luke

    Full version of IDA definitely supports it. So you can either pirate it or upload the files and someone with a copy can send you an IDC…

  2. blitter

    Have a beater PC lying around? Why not install Windows 3.1 and try the normal version of Turbo Debugger there?

  3. Diego Elio Pettenò

    Thanks for keeping hit this around! I wonder how difficult it would be for me to write a disassembler for NE — I have written parsers for MZ and ELF before, but never NE, and never wrote disassemblers. I guess I have something to look forward to write now ;)

    Also, yes most of the corpus is freely available, among others on Gutenberg, but not all of it. The main reason why I went back to look at this was that I was looking for a Pirandello story that I couldn’t find there. And the other problem is worse… old Italian is sometimes slightly mismatched to modern (think of the way your country’f fatherf ufed to write their effes), and I’ve been told (although I have not checked first hand) that at least a couple of Gutenberg-hosted “retypes” from the books were trying to correct typos by changing words around.

    So yes I would still be extremely grateful to you if you figure out how to unlock this corpus!

  4. Kostya

    Heh, I remember _installing_ Windows 3.11 on some machines. BTW, aren’t “Dare To Dream” and “Palace Of Deceit” crappy point and click adventures from Epic? I even remember seeing a floppy on which Skifree came (but I liked Chip’s Challenge more).

    As for the project itself – why not use strace on some text to see where it reads it from, then you probably can search for pointers to that file inside index and maybe encryption too (it’s usually easy to recognize when you have a known plaintext). I’d suspect some filesystem-like organisation for those files too.

  5. Cd-MaN

    Perhaps a simpler way: have you considered extracting the text by automating the interface / trough the Win16 API? Ie. send it the proper keypresses / clicks to open up each section and then use something like the GetWindowText to extract the actual content? (which should work even if copying is disabled).

  6. Multimedia Mike Post author

    @Luke: Thanks for the tip. I’ll be on the lookout for someone with a full copy of IDA Pro.

    @Blitter: I do have a spare PC (many, in fact) lying around. I was planning to install Windows 95, 98, or XP on it, mostly for older games, and I expect Windows 3.1 programs to work as well. So that’s another avenue I’m pursuing.

    @Diego: I expect that writing the format parser will be far easier than writing the disassembler, particularly for 16-bit x86 ASM (I’m trying to remember segment:offset addressing). But I think it might be possible to transcode the NE file to MZ by rewriting the header and inserting function stubs where the Windows 3.1 function calls would be. Of course the program wouldn’t run, but I expect an MZ disassembler to be able to produce a deadlisting, which is what I really, really want.

  7. Multimedia Mike Post author

    @Owen: I tried OpenWatcom during this exercise. I downloaded the binary Linux blob and it just segfaulted when run. A bit of a dead end for now.

    @Kostya: I was a big fan of Chip’s Challenge as well. Per your suggestions– Are you talking about scanning the DOSBox program memory for the plaintext? Remember, I can’t actually use strace directly on this target program. I’ve pretty well determined the overall structure of the relevant files; now I’m trying to decompress / decrypt / decode the data chunks inside.

    @Cd-MaN: I’ve been exposed to that technique (Mans suggested it on Diego’s original blog post linked above). It sounds like a lot of work, though. Sure, what I’m doing has been a lot of work as well. I guess the difference is that wiring up the UI calls and capturing data that way would involve a lot of manual effort as I would have to march through the entire corpus on all 6 discs. It strikes me as error-prone (I’m known to make a lot of mistakes).

  8. Phil

    If you can get the executables online (or email them to me) I can run them through my copy of IDA Pro and see what comes out. I’ve used it before on NE files and it produces readable output just fine.

  9. Diego Elio Pettenò

    Mike, disassembly is easier once you have just the instructions — and given I wrote an 8086 emulator at some point, maybe I can write a disassembler at this :D

    Kostya, yes Dare to Dream was a terribly crappy shareware adventure, I had it as well. Likely in the same exact disk as whoever installed it there, as it came with Castle of the Winds, which was a very pleasing (for an 8 years old who didn’t speak any English) hack-n-slash RPG …

  10. David

    I know V-Com’s Sourcer 8 will disassemble NE files, just be prepared for a huge dump though. (Older versions may as well)

    winp -datfile win31.dat
    sr .wdf

    The only issue that may arise, is if the exe is really complex it can run out of memory.

  11. David

    winp -datfile win31.dat (file.exe)
    sr (file).wdf

    @Mike I used angle brackets in my last comment, might want to check that.

  12. Kostya

    Mike, I did not mean automatic comparison in that case, but you know what it reads from those files and what you get on screen.

    And some of us use leaked IDA 6.1 available in the usual places (search for ida 6.1 rdw if you dare).

    Also despite the name I wouldn’t be surprised if they’ve employed some LZ77-based scheme for compression.

  13. Multimedia Mike Post author

    Meanwhile, I am chasing up another possible method: Turns out that DOSBox has an integrated debugger after all. It’s little known and hard to use, but it’s there. I’m trying to make it useful.

Comments are closed.