A long time ago, I procured a 1999 game called Clue Chronicles: Fatal Illusion, based on the classic board game Clue, a.k.a. Cluedo. At the time, I was big into collecting old, unloved PC games so that I could research obscure multimedia formats.
Surveying the 3 CD-ROMs contained in the box packaging revealed only Smacker (SMK) videos for full motion video which was nothing new to me or the multimedia hacking community at the time. Studying the mix of data formats present on the discs, I found a selection of straightforward formats such as WAV for audio and BMP for still images. I generally find myself more fascinated by how computer games are constructed rather than by playing them, and this mix of files has always triggered a strong “I could implement a new engine for this!” feeling in me, perhaps as part of the ScummVM project which already provides the core infrastructure for reimplementing engines for 2D adventure games.
Tying all of the assets together is a custom high-level programming language. I have touched on this before in a blog post over a decade ago. The scripts are in a series of files bearing the extension .ini (usually reserved for configuration scripts, but we’ll let that slide). A representative sample of such a script can be found here:
What Is This Language?
At the time I first analyzed this language, I was still primarily a C/C++-minded programmer, with a decent amount of Perl experience as a high level language, and had just started to explore Python. I assessed this language to be “mildly object oriented with C++-type comments (‘//’) and reliant upon a number of implicit library functions”. Other people saw other properties. When I look at it nowadays, it reminds me a bit more of JavaScript than C++. I think it’s sort of a Rorschach test for programming languages.
Strangely, I sort of had this fear that I would put a lot of effort into figuring out how to parse out the language only for someone to come along and point out that it’s a well-known yet academic language that already has a great deal of supporting code and libraries available as open source. Google for “spanish dolphins far side comic” for an illustration of the feeling this would leave me with.
It doesn’t matter in the end. Even if such libraries exist, how easy would they be to integrate into something like ScummVM? Time to focus on a workable approach to understanding and processing the format.
Problem Scope
So I set about to see if I can write a program to parse the language seen in these INI files. Some questions:
- How large is the corpus of data that I need to be sure to support?
- What parsing approach should I take?
- What is the exact language format?
- Other hidden challenges?
To figure out how large the data corpus is, I counted all of the INI files on all of the discs. There are 138 unique INI files between the 3 discs. However, there are 146 unique INI files after installation. This leads to a hidden challenge described a bit later.
What parsing approach should I take? I worried a bit too much that I might not be doing this the “right” way. I’m trying to ignore doubts like this, like how “SQL Shame” blocked me on a task for a little while a few years ago as I concerned myself that I might not be using the purest, most elegant approach to the problem. I know I covered language parsing a lot time ago in university computer science education and there is a lot of academic literature to the matter. But sometimes, you just have to charge in and experiment and prototype and see what falls out. In doing so, I expect to have a better understanding of the problems that need to solved and the right questions to ask, not unlike that time that I wrote a continuous integration system from scratch because I didn’t actually know that “continuous integration” was the keyword I needed.
Next, what is the exact language format? I realized that parsing the language isn’t the first and foremost problem here– I need to know exactly what the language is. I need to know what the grammar are keywords are. In essence, I need to reverse engineer the language before I write a proper parser for it. I guess that fits in nicely with the historical aim of this blog (reverse engineering).
Now, about the hidden challenges– I mentioned that there are 8 more INI files after the game installs itself. Okay, so what’s the big deal? For some reason, all of the INI files are in plaintext on the CD-ROM but get compressed (apparently, according to file size ratios) when installed to the hard drive. This includes those 8 extra INI files. I thought to look inside the CAB installation archive file on the CD-ROM and the files were there… but all in compressed form. I suspect that one of the files forms the “root” of the program and is the launching point for the game.
Parsing Approach
I took a stab at parsing an INI file. My approach was to first perform lexical analysis on the file and create a list of 4 types: symbols, numbers, strings, and language elements ([]{}()=.,:). Apparently, this is the kind of thing that Lex/Flex are good at. This prototyping tool is written in Python, but when I port this to ScummVM, it might be useful to call upon the services of Lex/Flex, or another lexical analyzer, for there are many. I have a feeling it will be easier to use better tools when I understand the full structure of the language based on the data available.
The purpose of this tool is to explore all the possibilities of the existing corpus of INI files. To that end, I ran all 138 of the plaintext files through it, collected all of the symbols, and massaged the results, assuming that the symbols that occurred most frequently are probably core language features. These are all the symbols which occur more than 1000 times among all the scripts:
6248 false 5734 looping 4390 scripts 3877 layer 3423 sequentialscript 3408 setactive 3360 file 3257 thescreen 3239 true 3008 autoplay 2914 offset 2599 transparent 2441 text 2361 caption 2276 add 2205 ge 2197 smackanimation 2196 graphicscript 2196 graphic 1977 setstate 1642 state 1611 skippable 1576 desc 1413 delayscript 1298 script 1267 seconds 1019 rect
About That Compression
I have sorted out at least these few details of the compression:
bytes 0-3 "COMP" (a pretty strong sign that this is, in fact, compressed data) bytes 4-11 unknown bytes 12-15 size of uncompressed data bytes 16-19 size of compressed data (filesize - 20) bytes 20- compressed payload
The compression ratios are on the same order of gzip. I was hoping that it was stock zlib data. However, I have been unable to prove this. I wrote a Python script that scrubbed through the first 100 bytes of payload data and tried to get Python’s zlib.decompress to initialize– no luck. It’s frustrating to know that I’ll have to reverse engineer a compression algorithm that deals with just 8 total text files if I want to see this effort through to fruition.
Update, January 15, 2019
Some folks expressed interest in trying to sort out the details of the compression format. So I have posted a followup in which I post some samples and go into deeper details about things I have tried:
If you posted an example of a compressed ini file and the matching uncompressed ini file, someone out there might be able to figure out the compression (or recognize that its some existing compression algorithm)
Did you try other similar compression algorithms, like bzip2? https://docs.python.org/2/library/bz2.html
It would be easy to do as you already have the code for gzip, and maybe you get lucky…
Feel free to post some samples so we can have a stab at it!
Thanks for taking an interest in this. I posted a followup with links to samples, as well as what I have tried:
https://multimedia.cx/eggs/re-clue-chronicles-compression/
I have bzip2 a try. No luck there, either. But I also realized my attempt at zlib was probably wrong as well.