Developing MobyCAIRO

I recently published a tool called MobyCAIRO. The ‘CAIRO’ part stands for Computer-Assisted Image ROtation, while the ‘Moby’ prefix refers to its role in helping process artifact image scans to submit to the MobyGames database. The tool is meant to provide an accelerated workflow for rotating and cropping image scans. It works on both Windows and Linux. Hopefully, it can solve similar workflow problems for other people.

As of this writing, MobyCAIRO has not been tested on Mac OS X yet– I expect some issues there that should be easily solvable if someone cares to test it.

The rest of this post describes my motivations and how I arrived at the solution.

Background
I have scanned well in excess of 2100 images for MobyGames and other purposes in the past 16 years or so. The workflow looks like this:

Image workflow

It should be noted that my original workflow featured me manually rotating the artifact on the scanner bed in order to ensure straightness, because I guess I thought that rotate functions in image editing programs constituted dark, unholy magic or something. So my workflow used to be even more arduous:

I can’t believe I had the patience to do this for hundreds of scans

Sometime last year, I was sitting down to perform some more scanning and found myself dreading the oncoming tedium of straightening and cropping the images. This prompted a pivotal question:

Why can’t a computer do this for me?

After all, I have always been a huge proponent of making computers handle the most tedious, repetitive, mind-numbing, and error-prone tasks. So I did some web searching to find if there were any solutions that dealt with this. I also consulted with some like-minded folks who have to cope with the same tedious workflow.

I came up empty-handed. So I endeavored to develop my own solution.

Problem Statement and Prior Work
Continue reading →

Small Time DevOps

When you are a certain type of nerd who has been on the internet for long enough, you might run the risk of accumulating a lot of projects and websites. Website-wise, I have this multimedia.cx domain on which I host a bunch of ancient static multimedia documents as well as this PHP/MySQL-based blog. Further, there are 3 other PHP/MySQL-based blogs hosted on subdomains. Also, there is the wiki, another PHP/MySQL web app. A few other custom PHP- and Python-based apps are running around on the server as well.

While things largely run on auto-pilot, I need to concern myself every now and then with their ongoing upkeep.

If you ask N different people about the meaning of the term ‘DevOps’, you will surely get N different definitions. However, whenever I have to perform VM maintenance, I like to think I am at least dipping my toes into the DevOps domain. At the very least, the job seems to be concerned with making infrastructure setup and upgrades reliable and repeatable.

Even if it’s not fully automated, at the very least, I have generated a lot of lists for how to make things work (I’m a big fan of Trello’s Kanban boards for this), so it gets easier every time (ideally, anyway).

Infrastructure History

For a solid decade, from 2004 to 2014, everything was hosted on shared, cPanel-based web hosting. In mid-2014, I moved from the shared hosting over to my own VPSs, hosted on DigitalOcean. I must have used Ubuntu 14.04 at the time, as I look down down the list of Ubuntu LTS releases. It was with much trepidation that I undertook this task (knowing that anything that might go wrong with the stack, from the OS up to the apps, would all be firmly my fault), but it turned out not to be that bad. The earliest lesson you learn for such a small-time setup is to have a frontend VPS (web server) and a backend VPS (database server). That way, a surge in HTTP requests has no chance of crashing the database server due to depleted memory.

At the end of 2016, I decided to refresh the VMs. I brought them up to Ubuntu 16.04 at the time.

Earlier this year, I decided it would be a good idea to refresh the VMs again since it had been more than 3 years. The VMs were getting long in the tooth. Plus, I had seen an article speculating that Azure, another notable cloud hosting environment, might be getting full. It made me feel like I should grab some resources while I still could (resource-hoarding was in this year).

I decided to use 18.04 for these refreshed VMs, even though 20.04 was available. I think I was a little nervous about 20.04 because I heard weird things about something called snap packages being the new standard for distributing software for the platform and I wasn’t ready to take that plunge.

Which brings me to this month’s VM refresh in which I opted to take the 20.04 plunge.

Oh MediaWiki

I’ve been the maintainer and caretaker of the MultimediaWiki for 15 years now (wow! Where does the time go?). It doesn’t see a lot of updating these days, but I know it still serves as a resource for lots of obscure technical multimedia information. I still get requests for new accounts because someone has uncovered some niche technical data and wants to make sure it gets properly documented.

MediaWiki is quite an amazing bit of software and it undergoes constant development and improvement. According to the version history, I probably started the MultimediaWiki with the 1.5 series. As of this writing, 1.35 is the latest and therefore greatest lineage.

This pace of development can make it a bit of a chore to keep up to date. This was particularly true in the old days of the shared hosting when you didn’t have direct shell access and so it’s something you put off for a long time.

Honestly, to be fair, the upgrade process is pretty straightforward:

Unpack a set of new files on top of the existing tree
Run a PHP script to perform any database table upgrades

Pretty straightforward, assuming that there are no hiccups along the way, right? And the vast majority of the time, that’s the case. Until it’s not. I had an upgrade go south about a year and a half ago (I wasn’t the only MW installation to have the problem at the time, I learned). While I do have proper backups, it still threw me for a loop and I worked for about an hour to restore the previous version of the site. That experience understandably left me a bit gun-shy about upgrading the wiki.

But upgrades must happen, especially when security notices come out. Eventually, I created a Trello template with a solid, 18-step checklist for upgrading MW as soon as a new version shows up. It’s still a chore, just not so nerve-wracking when the steps are all enumerated like that.

As I compose the post, I think I recall my impetus for wanting to refresh from the 16.04 VM. 16.04 used PHP 7.0. I wanted to upgrade to the latest MW, but if I tried to do so, it warned me that it needed PHP 7.4. So I initialized the new 18.04 VM as described above… only to realize that PHP 7.2 is the default on 18.04. You need to go all the way to 20.04 for 7.4 standard. I’m sure it’s possible to install later versions of PHP on 16.04 or 18.04, but I appreciate going with the defaults provided by the distro.

I figured I would just stay with MediaWiki 1.34 series and eschew 1.35 series (requiring PHP 7.4) for the time being… until I started getting emails that 1.34 would go end-of-life soon. Oh, and there are some critical security updates, but those are only for 1.35 (and also 1.31 series which is still stubbornly being maintained for some reason).

So here I am with a fresh Ubuntu 20.04 VM running PHP 7.4 and MediaWiki 1.35 series.

How Much Process?

Anyone who decides to host on VPSs vs, say, shared hosting is (or ought to be) versed on the matter that all your data is your own problem and that glitches sometimes happen and that your VM might just suddenly disappear. (Indeed, I’ve read rants about VMs disappearing and taking entire un-backed-up websites with them, and also watched as the ranters get no sympathy– “yeah, it’s a VM; the data is your responsibility”) So I like to make sure I have enough notes so that I could bring up a new VM quickly if I ever needed to.

But the process is a lot of manual steps. Sometimes I wonder if I need to use some automation software like Ansible in order to bring a new VM to life. Why do that if I only update the VM once every 1-3 years? Well, perhaps I should update more frequently in order to ensure the process is solid?

Seems like a lot of effort for a few websites which really don’t see much traffic in the grand scheme of things. But it still might be an interesting exercise and might be good preparation for some other websites I have in mind.

Besides, if I really wanted to go off the deep end, I would wrap everything up in containers and deploy using D-O’s managed Kubernetes solution.

Reverse Engineering Clue Chronicles Compression

My last post described my exploration into the 1999 computer game Clue Chronicles: Fatal Illusion. Some readers expressed interest in the details so I thought I would post a bit more about how I have investigated and what I have learned.

It’s frustrating to need to reverse engineer a compression algorithm that is only applied to a total of 8 files (out of a total set of ~140), but here we are. Still, I’m glad some others expressed interest in this challenge as it motivated me to author this post, which in turn prompted me to test and challenge some of my assumptions.

Spoiler: Commenter ‘m’ gave me the clue I needed: PKWare Data Compression Library used the implode algorithm rather than deflate. I was able to run this .ini data through an open source explode algorithm found in libmpq and got the correct data out.

Files To Study
I uploaded a selection of files for others to study, should they feel so inclined. These include the main game binary (if anyone has ideas about how to isolate the decompression algorithm from the deadlisting); compressed and uncompressed examples from 2 files (newspaper.ini and Drink.ini); and the compressed version of Clue.ini, which I suspect is the root of the game’s script.

The Story So Far
This ad-hoc scripting language found in the Clue Chronicles game is driven by a series of .ini files that are available in both compressed and uncompressed forms, save for a handful of them which only come in compressed flavor. I have figured out a few obvious details of the compressed file format:

bytes 0-3 "COMP"
bytes 4-11 unknown
bytes 12-15 size of uncompressed data
bytes 16-19 size of compressed data (filesize - 20 bytes)
bytes 20- compressed payload

The average compression ratio is on the same order as what could be achieved by running ‘gzip’ against the uncompressed files and using one of the lower number settings (i.e., favor speed vs. compression size, e.g., ‘gzip -2’ or ‘gzip -3’). Since the zlib/DEFLATE algorithm is quite widespread on every known computing platform, I thought that this would be a good candidate to test.

Exploration
My thinking was that I could load the bytes in the compressed ini file and feed it into Python’s zlib library, sliding through the first 100 bytes to see if any of them “catch” on the zlib decompression algorithm.

Here is the exploration script:
Continue reading →

Parsing The Clue Chronicles

A long time ago, I procured a 1999 game called Clue Chronicles: Fatal Illusion, based on the classic board game Clue, a.k.a. Cluedo. At the time, I was big into collecting old, unloved PC games so that I could research obscure multimedia formats.

Surveying the 3 CD-ROMs contained in the box packaging revealed only Smacker (SMK) videos for full motion video which was nothing new to me or the multimedia hacking community at the time. Studying the mix of data formats present on the discs, I found a selection of straightforward formats such as WAV for audio and BMP for still images. I generally find myself more fascinated by how computer games are constructed rather than by playing them, and this mix of files has always triggered a strong “I could implement a new engine for this!” feeling in me, perhaps as part of the ScummVM project which already provides the core infrastructure for reimplementing engines for 2D adventure games.

Tying all of the assets together is a custom high-level programming language. I have touched on this before in a blog post over a decade ago. The scripts are in a series of files bearing the extension .ini (usually reserved for configuration scripts, but we’ll let that slide). A representative sample of such a script can be found here:

clue-chronicles-scarlet-1.txt

What Is This Language?
At the time I first analyzed this language, I was still primarily a C/C++-minded programmer, with a decent amount of Perl experience as a high level language, and had just started to explore Python. I assessed this language to be “mildly object oriented with C++-type comments (‘//’) and reliant upon a number of implicit library functions”. Other people saw other properties. When I look at it nowadays, it reminds me a bit more of JavaScript than C++. I think it’s sort of a Rorschach test for programming languages.

Strangely, I sort of had this fear that I would put a lot of effort into figuring out how to parse out the language only for someone to come along and point out that it’s a well-known yet academic language that already has a great deal of supporting code and libraries available as open source. Google for “spanish dolphins far side comic” for an illustration of the feeling this would leave me with.

It doesn’t matter in the end. Even if such libraries exist, how easy would they be to integrate into something like ScummVM? Time to focus on a workable approach to understanding and processing the format.

Problem Scope
So I set about to see if I can write a program to parse the language seen in these INI files. Some questions:

How large is the corpus of data that I need to be sure to support?
What parsing approach should I take?
What is the exact language format?
Other hidden challenges?

To figure out how large the data corpus is, I counted all of the INI files on all of the discs. There are 138 unique INI files between the 3 discs. However, there are 146 unique INI files after installation. This leads to a hidden challenge described a bit later.

What parsing approach should I take? I worried a bit too much that I might not be doing this the “right” way. I’m trying to ignore doubts like this, like how “SQL Shame” blocked me on a task for a little while a few years ago as I concerned myself that I might not be using the purest, most elegant approach to the problem. I know I covered language parsing a lot time ago in university computer science education and there is a lot of academic literature to the matter. But sometimes, you just have to charge in and experiment and prototype and see what falls out. In doing so, I expect to have a better understanding of the problems that need to solved and the right questions to ask, not unlike that time that I wrote a continuous integration system from scratch because I didn’t actually know that “continuous integration” was the keyword I needed.

Next, what is the exact language format? I realized that parsing the language isn’t the first and foremost problem here– I need to know exactly what the language is. I need to know what the grammar are keywords are. In essence, I need to reverse engineer the language before I write a proper parser for it. I guess that fits in nicely with the historical aim of this blog (reverse engineering).

Now, about the hidden challenges– I mentioned that there are 8 more INI files after the game installs itself. Okay, so what’s the big deal? For some reason, all of the INI files are in plaintext on the CD-ROM but get compressed (apparently, according to file size ratios) when installed to the hard drive. This includes those 8 extra INI files. I thought to look inside the CAB installation archive file on the CD-ROM and the files were there… but all in compressed form. I suspect that one of the files forms the “root” of the program and is the launching point for the game.

Parsing Approach
I took a stab at parsing an INI file. My approach was to first perform lexical analysis on the file and create a list of 4 types: symbols, numbers, strings, and language elements ([]{}()=.,:). Apparently, this is the kind of thing that Lex/Flex are good at. This prototyping tool is written in Python, but when I port this to ScummVM, it might be useful to call upon the services of Lex/Flex, or another lexical analyzer, for there are many. I have a feeling it will be easier to use better tools when I understand the full structure of the language based on the data available.
Continue reading →

Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering

Developing MobyCAIRO

Small Time DevOps

Reverse Engineering Clue Chronicles Compression

Parsing The Clue Chronicles