Overthinking My Search Engine Problem

I wrote a search engine for my Game Music Appreciation website, because the site would have been significantly less valuable without it (and I would eventually realize that the search feature is probably the most valuable part of this endeavor). I came up with a search solution that was a bit sketchy, but worked… until it didn’t. I thought of a fix but still searched for more robust and modern solutions (where ‘modern’ is defined as something that doesn’t require compiling a C program into a static CGI script and hoping that it works on a server I can’t debug on).

Finally, I realized that I was overthinking the problem– did you know that a bunch of relational database management systems (RDBMSs) support full text search (FTS)? Okay, maybe you did, but I didn’t know this.

Problem Statement
My goal is to enable users to search the metadata (title, composer, copyright, other tags) attached to various games. To do this, I want to index a series of contrived documents that describe the metadata. 2 examples of these contrived documents, interesting because both of these games have very different titles depending on region, something the search engine needs to account for:

system: Nintendo NES
game: Snoopy's Silly Sports Spectacular
author: None; copyright: 1988 Kemco; dumped by: None
additional tags: Donald Duck.nsf Donald Duck

system: Super Nintendo
game: Arcana
author: Jun Ishikawa, Hirokazu Ando; copyright: 1992 HAL Laboratory; dumped by: Datschge
additional tags: card.rsn.gamemusic Card Master Cardmaster

The index needs to map these documents to various pieces of game music and the search solution needs to efficiently search these documents and find the various game music entries that match a user’s request.

Now that I’ve been looking at it for long enough, I’m able to express the problem surprisingly succinctly. If I had understood that much originally, this probably would have been simpler.

First Solution & Breakage
My original solution was based on SWISH-E. The CGI script was a C program that statically linked the SWISH-E library into a binary that miraculously ran on my web provider. At least, it ran until it decided to stop working a month ago when I added a new feature unrelated to search. It was a very bizarre problem, the details of which would probably bore you to tears. But if you care, the details are all there in the Stack Overflow question I asked on the matter.

While no one could think of a direct answer to the problem, I eventually thought of a roundabout fix. The problem seemed to pertain to the static linking. Since I couldn’t count on the relevant SWISH-E library to be on my host’s system, I uploaded the shared library to the same directory as the CGI script and used dlopen()/dlsym() to fetch the functions I needed. It worked again, but I didn’t know for how long.

Searching For A Hosted Solution
I know that anything is possible in this day and age; while my web host is fairly limited, there are lots of solutions for things like this and you can deploy any technology you want, and for reasonable prices. I figured that there must be a hosted solution out there.

I have long wanted a compelling reason to really dive into Amazon Web Services (AWS) and this sounded like a good opportunity. After all, my script works well enough; if I could just find a simple Linux box out there where I could install the SWISH-E library and compile the CGI script, I should be good to go. AWS has a free tier and I started investigating this approach. But it seems like a rabbit hole with a lot of moving pieces necessary for such a simple task.

I had heard that AWS had something in this area. Sure enough, it’s called CloudSearch. However, I’m somewhat discouraged by the fact that it would cost me around $75 per month to run the smallest type of search instance which is at the core of the service.

Finally, I came to another platform called Heroku. It’s supposed to be super-scalable while having a free tier for hobbyists. I started investigating FTS on Heroku and found this article which recommends using the FTS capabilities of their standard hosted PostgreSQL solution. However, the free tier of Postgres hosting only allows for 10,000 rows of data. Right now, my database has about 5400 rows. I expect it to easily overflow the 10,000 limit as soon as I incorporate the C64 SID music corpus.

However, this Postgres approach planted a seed.

RDBMS Revelation
I have 2 RDBMSs available on my hosting plan– MySQL and SQLite (the former is a separate service while SQLite is built into PHP). I quickly learned that both have FTS capabilities. Since I like using SQLite so much, I elected to leverage its FTS functionality. And it’s just this simple:

CREATE VIRTUAL TABLE gamemusic_metadata_fts USING fts3
( content TEXT, game_id INT, title TEXT );

SELECT id, title FROM gamemusic_metadata_fts WHERE content MATCH "arcana";
479|Arcana

The ‘content’ column gets the metadata pseudo-documents. The SQL gets wrapped up in a little PHP so that it queries this small database and turns the result into JSON. The script is then ready as a drop-in replacement for the previous script.

5 thoughts on “Overthinking My Search Engine Problem

  1. Reimar

    Some comments on the original problem.
    Maybe you know these and they might not be interesting to you, but still.
    First, if you link statically you link to your system’s libc. This libc might require a newer kernel than the one available in your web server. In this case the cause would be a libc update on your local system. Compiling on a Debian-stable or -oldstable tends to avoid that issue.
    An even better solution might be to link against a more stable/smaller/robust libc if you do not need the full features, for example dietlibc – an example of one of my projects using it: http://sourceforge.net/p/xwahacker/code/ci/master/tree/Makefile
    (basically you just replace “gcc” with “diet gcc” everywhere).
    In addition, instead if using dlopen/dlsym you could have used -Wl,rpath to specify an additional shared library searchpath (kind of an LD_LIBRARY_PATH encoded into the binary). Of course there is a certain security risk in that (as well as in a dlopen that searches a bit wide), but I suspect security might not actually be the absolute priority for you in this project anyway ;-)

    As a side point: I guess you are aware that a “MATCH” sql statement is a really good way to bring an SQL server to its knees if you ever expect more than a few users, particularly if it’s sqlite :-). I don’t know if it’s even possible to add an index that can be used with a “MATCH” statement. And it’s not quite as bright as what we are nowadays used to from a search engine: fuzzy matches etc. (as an aside, I wished Google would stop taking this so far, most of the time I actually do know what I am writing and meant exactly what I wrote, it’s kind of crossed into “I know better than you what you want” territory…)

  2. Reimar

    @Owen: I guess it shows that
    a) I actually have no clue about this
    b) I even missed the “fts3” thing completely

    ;-)

Comments are closed.