Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Overthinking My Search Engine Problem

December 30th, 2013 by Multimedia Mike

I wrote a search engine for my Game Music Appreciation website, because the site would have been significantly less valuable without it (and I would eventually realize that the search feature is probably the most valuable part of this endeavor). I came up with a search solution that was a bit sketchy, but worked… until it didn’t. I thought of a fix but still searched for more robust and modern solutions (where ‘modern’ is defined as something that doesn’t require compiling a C program into a static CGI script and hoping that it works on a server I can’t debug on).

Finally, I realized that I was overthinking the problem– did you know that a bunch of relational database management systems (RDBMSs) support full text search (FTS)? Okay, maybe you did, but I didn’t know this.

Problem Statement
My goal is to enable users to search the metadata (title, composer, copyright, other tags) attached to various games. To do this, I want to index a series of contrived documents that describe the metadata. 2 examples of these contrived documents, interesting because both of these games have very different titles depending on region, something the search engine needs to account for:

system: Nintendo NES
game: Snoopy's Silly Sports Spectacular
author: None; copyright: 1988 Kemco; dumped by: None
additional tags: Donald Duck.nsf Donald Duck

system: Super Nintendo
game: Arcana
author: Jun Ishikawa, Hirokazu Ando; copyright: 1992 HAL Laboratory; dumped by: Datschge
additional tags: card.rsn.gamemusic Card Master Cardmaster

The index needs to map these documents to various pieces of game music and the search solution needs to efficiently search these documents and find the various game music entries that match a user’s request.

Now that I’ve been looking at it for long enough, I’m able to express the problem surprisingly succinctly. If I had understood that much originally, this probably would have been simpler.

First Solution & Breakage
My original solution was based on SWISH-E. The CGI script was a C program that statically linked the SWISH-E library into a binary that miraculously ran on my web provider. At least, it ran until it decided to stop working a month ago when I added a new feature unrelated to search. It was a very bizarre problem, the details of which would probably bore you to tears. But if you care, the details are all there in the Stack Overflow question I asked on the matter.

While no one could think of a direct answer to the problem, I eventually thought of a roundabout fix. The problem seemed to pertain to the static linking. Since I couldn’t count on the relevant SWISH-E library to be on my host’s system, I uploaded the shared library to the same directory as the CGI script and used dlopen()/dlsym() to fetch the functions I needed. It worked again, but I didn’t know for how long.

Searching For A Hosted Solution
I know that anything is possible in this day and age; while my web host is fairly limited, there are lots of solutions for things like this and you can deploy any technology you want, and for reasonable prices. I figured that there must be a hosted solution out there.

I have long wanted a compelling reason to really dive into Amazon Web Services (AWS) and this sounded like a good opportunity. After all, my script works well enough; if I could just find a simple Linux box out there where I could install the SWISH-E library and compile the CGI script, I should be good to go. AWS has a free tier and I started investigating this approach. But it seems like a rabbit hole with a lot of moving pieces necessary for such a simple task.

I had heard that AWS had something in this area. Sure enough, it’s called CloudSearch. However, I’m somewhat discouraged by the fact that it would cost me around $75 per month to run the smallest type of search instance which is at the core of the service.

Finally, I came to another platform called Heroku. It’s supposed to be super-scalable while having a free tier for hobbyists. I started investigating FTS on Heroku and found this article which recommends using the FTS capabilities of their standard hosted PostgreSQL solution. However, the free tier of Postgres hosting only allows for 10,000 rows of data. Right now, my database has about 5400 rows. I expect it to easily overflow the 10,000 limit as soon as I incorporate the C64 SID music corpus.

However, this Postgres approach planted a seed.

RDBMS Revelation
I have 2 RDBMSs available on my hosting plan– MySQL and SQLite (the former is a separate service while SQLite is built into PHP). I quickly learned that both have FTS capabilities. Since I like using SQLite so much, I elected to leverage its FTS functionality. And it’s just this simple:

CREATE VIRTUAL TABLE gamemusic_metadata_fts USING fts3
( content TEXT, game_id INT, title TEXT );

SELECT id, title FROM gamemusic_metadata_fts WHERE content MATCH "arcana";

The ‘content’ column gets the metadata pseudo-documents. The SQL gets wrapped up in a little PHP so that it queries this small database and turns the result into JSON. The script is then ready as a drop-in replacement for the previous script.

Posted in General | 5 Comments »