Re-solving My Search Engine Problem

14 years ago, I created a web database of 8-bit Nintendo Entertainment System games. To make it useful, I developed a very primitive search feature.

A few months ago, I decided to create a web database of video game music. To make it useful, I knew it would need to have a search feature. I realized I needed to solve the exact same problem again.

The last time I solved this problem, I came up with an excruciatingly na├»ve idea. Hey, it worked. I really didn’t want to deploy the same solution again because it felt so silly the first time. Surely there are many better ways to solve it now? Many different workable software solutions that do all the hard work for me?

The first time I attacked this, it was 1998 and hosting resources were scarce. On my primary web host I was able to put static HTML pages, perhaps with server side includes. The web host also offered dynamic scripting capabilities via something called htmlscript (a.k.a. MIVA Script). I had a secondary web host in my ISP which allowed me to host conventional CGI scripts on a Unix host, so that’s where I hosted the search function (Perl CGI script accessing a key/value data store file).

Nowadays, sky’s the limit. Any type of technology you want to deploy should be tractable. Still, a key requirement was that I didn’t want to pay for additional hosting resources for this silly little side project. That leaves me with options that my current shared web hosting plan allows, which includes such advanced features as PHP, Perl and Python scripts. I can also access MySQL.

There are a lot of mature software packages out there which can index and search data and be plugged into a website. But a lot of them would be unworkable on my web hosting plan due to language or library package limitations. Further, a lot of them feel like overkill. At the most basic level, all I really want to do is map a series of video game titles to URLs in a website.

Based on my research, Lucene seems to hold a fair amount of mindshare as an open source indexing and search solution. But I was unsure of my ability to run it on my hosting plan. I think MySQL does some kind of full text search, so I could have probably made a solution around that. Again, it just feels like way more power than I need for this project.

I used Swish-e once about 3 years ago for a little project. I wasn’t confident of my ability to run that on my server either. It has a Perl API but it requires custom modules.

My quest for a search solution grew deep enough that I started perusing a textbook on information retrieval techniques in preparation for possibly writing my own solution from scratch. However, in doing so, I figured out how I might subvert an existing solution to do what I want.

Back to Swish-e
Again, all I wanted to do was pull data out of a database and map that data to a URL in a website. Reading the Swish-e documentation, I learned that the software supports a mode specifically tailored for this. Rather than asking Swish-e to index a series of document files living on disk, you can specify a script for Swish-e to run and the script will generate what appears to be a set of phantom documents for Swish-e to index.

When I ‘add’ a game music file to the game music website, I have scripts that scrape the metadata (game title, system, song titles, composers, company, copyright, the original file name on disk, even the ripper/dumper who extracted the chiptune in the first place) and store it all in an SQLite database. When it’s time to update the database, another script systematically generates a series of pseudo-documents that spell out the metadata for each game and prefix each document with a path name. Searching for a term in the index returns a lists of paths that contain the search term. Thus, it makes sense for that path to be a site URL.

But what about a web script which can search this Swish-e index? That’s when I noticed Swish-e’s C API and came up with a crazy idea: Write the CGI script directly in C. It feels like sheer madness (or at least the height of software insecurity) to write a CGI script directly in C in this day and age. But it works (with the help of cgic for input processing), just as long as I statically link the search script with libswish-e.a (and libz.a). The web host is an x86 machine, after all.

I’m not proud of what I did here– I’m proud of how little I had to do here. The searching CGI script is all of about 30 lines of C code. The one annoyance I experienced while writing it is that I had to consult the Swish-e source code to learn how to get my search results (the “swishdocpath” key — or any other key — for SwishResultPropertyStr() is not documented). Also, the C program just does the simplest job possible, only querying the term in the index and returning the results in plaintext, in order of relevance, to the client-side JavaScript code which requested them. JavaScript gets the job of sorting and grouping the results for presentation.

Tuning the Search
Almost immediately, I noticed that the search engine could not find one of my favorite SNES games, U.N. Squadron. That’s because all of its associated metadata names Area 88, the game’s original title. Thus, I had to modify the metadata database to allow attaching somewhat free-form tags to games in order to compensate. In this case, an alias title would show up in the game’s pseudo-document.

Roman numerals are still a thorn in my side, just as they were 14 years ago in my original iteration. I dealt with it back then by converting all numbers to Roman numerals during the index and searching processes. I’m not willing to do that for this case and I’m still looking for a good solution.

Another annoying problem deals with Mega Man, a popular franchise. The proper spelling is 2 words but it’s common for people to mash it into one word, Megaman (see also: Spider-Man, Spiderman, Spider Man). The index doesn’t gracefully deal with that and I have some hacks in place to cope for the time being.

Positive Results
I’m pleased with the results so far, and so are the users I have heard from. I know one user expressed amazement that a search for Castlevania turned up Akumajou Densetsu, the Japanese version of Castlevania III: Dracula’s Curse. This didn’t surprise me because I manually added a hint for that mapping. (BTW, if you are a fan of Castlevania III, definitely check out the Akumajou Densetsu soundtrack which has an upgraded version of the same soundtrack using special audio channels.)

I was a little more surprised when a user announced that searching for ‘probotector’ correctly turned up Contra: Hard Corps. I looked into why this was. It turns out that the original chiptune filename was extremely descriptive: “Contra – Hard Corps [Probotector] (1994-08-08)(Konami)”. The filenames themselves often carry a bunch of useful metadata which is why it’s important to index those as well.

And of course, many rippers, dumpers, and taggers have labored for over a decade to lovingly tag these songs with as much composer information as possible, which all gets indexed. The search engine gets a lot of compliments for its ability to find many songs written by favorite composers.