Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Archives:

My Very Own Search Engine

October 4th, 2006 by Multimedia Mike

Back in 1998, I started a web project called the Internet NES Database to catalog information about games for the greatest video game console system of all time, the Nintendo Entertainment System. I reasoned that a web database needed to be searchable but I couldn’t find any literature on exactly how to create a search engine. So I sat down and thought about it and came up with a solution. I eventually refined the idea and later wrote this article about the design decisions I made but never got around to publishing it anywhere. I came across the article on my hard drive recently and figured I may as well publish it in case someone, somewhere might find it interesting.


8-bit NES

It’s weird to think that I was building an elaborate search mechanism to index approximately 760 items (the total number of games in the database), though I had hoped to expand the database into something MobyGames-like in due time. Keep in mind that at the outset, while the data was managed in a MySQL database, the data was exposed online through a series of static HTML pages; this was 1998 and it was a little difficult and expensive to get database hosting on the internet. Thus, I could not directly query a database, and I had limited CGI scripting facilities.

Introduction

I present the approach that I took in developing a simple search engine for the Internet NES Database as well as the thinking and reasoning behind the design decisions. It’s not highly sophisticated, but if you don’t know anything about search engine operation, this should give you a starting point.

When I started developing my video game database in early 1998 I knew that it would be much more usable if it featured a search engine. I had no idea how to write a search engine. Web searches on topics such as “search engine theory” turned up thousands of pages claiming to reveal the secrets of getting high ranks in popular search engines. I realized that I was on my own. So I sat down and thought really hard about how to write a search engine.

Analysis

The first thing I did was identify exactly what it was that I expected my search engine to do. In other words, I needed to anticipate how my site’s visitors would be using my search engine. I anticipated that they would first think to input the title of their favorite NES game, such as “The Legend of Zelda”. Therefore, if a user inputs “The Legend of Zelda” into the search engine, it needs to be able to find the page with that game.

Of course, if the user had to input the exact string “The Legend of Zelda” with all of the spacing and capitalization shown, the search engine would be of very limited use. A user might forgo capital letters and input “the legend of zelda”. He might omit “the” and simply input “legend of zelda”. He may be in a big hurry and also not know where the CAPS LOCK key is and input “ZELDA”. The search engine still needs to be able to associate those search strings with the title “The Legend of Zelda”.

It became immediately obvious that the search operation needed to be case-insensitive so that “zelda” and “ZELDA” are treated the same. I also needed to associate a lot of title permutations with the title. As shown in the above example, “zelda”, “legend of zelda”, and “the legend of zelda” all needed to be associated with “The Legend of Zelda”. I decided to take this idea to its logical extreme and associate all of these strings with the title:

  • the
  • legend
  • of
  • zelda
  • the legend
  • legend of
  • of zelda
  • the legend of
  • legend of zelda
  • the legend of zelda

Do you see the pattern? Many years after creating the search engine, I learned that this is called a powerset. If you have a set of four symbols, { a, b, c, d } then the powerset is:

  • a
  • b
  • c
  • d
  • ab
  • bc
  • cd
  • abc
  • bcd
  • abcd

If I had known that I was computing something called a powerset I might have been able to find an algorithm to do it. As it was, I spent a long time getting it right.

Examination of the powerset for “The Legend of Zelda” reveals that some of the strings, like “the” and “of”, are pretty useless and can be eliminated. The remaining strings need to be associated with “The Legend of Zelda”. How do we do this? The verb “associated” indicates that this would be a good job for associative arrays, a.k.a. hash arrays, hash tables, database hashes, hashes, etc. We’ll call them hashes in this document.

A hash is a collection of associated key/value pairs. For example, we could have a hash that maps (associates) fruits with their usual colors. An abstract representation would look like this:

key value
apple => red
pear => green
grape => purple
banana => yellow
orange => orange
lemon => yellow
lime => green

If you query this FruitColor hash with the key “banana”, you will get the value “yellow”. If you query with the key “grape”, you will get the value “purple”, and so on.

Now let’s apply the hash concept to the search engine. We’ll map each string in the title powerset (minus words like “the” and “of”) to the title:

key value
legend => The Legend of Zelda
zelda => The Legend of Zelda
the legend => The Legend of Zelda
legend of => The Legend of Zelda
of zelda => The Legend of Zelda
the legend of => The Legend of Zelda
legend of zelda => The Legend of Zelda
the legend of zelda => The Legend of Zelda

If you query this hash with the key “legend of zelda” or just “zelda”, you will get the value “The Legend of Zelda”.

So far, so good. I decided to try another title, such as “Zelda II: The Adventure of Link”. I didn’t like the look of that colon character in there. I decided that in addition to not dealing with uppercase characters, the search engine would also not deal with anything except lowercase characters and numbers. Therefore, the colon character will be stripped out.

Now we lowercase the title, strip out non-alphanumeric characters (including spaces), compute the powerset and map the powerset strings to the title using a hash. The result looks like this:

key value
zelda => Zelda II: The Adventure of Link
ii => Zelda II: The Adventure of Link
adventure => Zelda II: The Adventure of Link
link => Zelda II: The Adventure of Link
zeldaii => Zelda II: The Adventure of Link
iithe => Zelda II: The Adventure of Link
theadventure => Zelda II: The Adventure of Link
adventureof => Zelda II: The Adventure of Link
oflink => Zelda II: The Adventure of Link
=> Zelda II: The Adventure of Link

You get the idea and I don’t feel like typing out the rest of the permutations. That’s a computer’s job. Notice that “zelda” occurs again. It occurred when we were operating on “The Legend of Zelda”. Therefore, the string “zelda” needs to map onto both of the titles, “The Legend of Zelda” and “Zelda II: The Adventure of Link”:

key value
zelda => The Legend of Zelda,Zelda II: The Adventure of Link

Implementation

Then I moved on to the actual implementation. There is a variety of standards and libraries for database hash files on UNIX platforms. I don’t know much about the differences/advantages/disadvantages of one compared to another. Currently, I’m using GNU gdbm. I wrote a Perl script which interfaced to the MySQL database and collected all of the titles and game IDs. I used the IDs because the games are referenced in the database by their IDs rather than their actual titles. For each title string, the Perl script lowercases the string, strips out the non-alphanumeric characters, computes the powerset and maps the significant resulting strings (tosses out “the”, “a”, “and”, “of”, etc.) to the game ID.

The original search.cgi script, which was the basis for the original database search engine, was written in Perl. It simply took the user’s input string, lowercased the string, stripped out non-alphanumeric characters and used it as a key in the database hash file. If the key wasn’t mapped to any values, the search script reported that the search was unsuccessful. If the key was mapped to one game ID, the CGI script used an HTTP “Location:” header to redirect the web browser directly to the relevant game page. Finally, if the key returned multiple game IDs, the IDs were used as keys into another database hash file which mapped game IDs to game titles. Then the search engine would list the titles returned from the search.

Refinements

The search.cgi script also included a facility to log all of the search queries, as well as their success rates, to a file. This was done so I could analyze search queries that database users were actually attempting and identify weaknesses in the search scheme.

One problem that I noticed right away was the number of variations of the word ‘and’ used in game titles. The word could be represented as an ampersand (&) or the letter ‘n’, possibly with an apostrophe before, after, or before and after. For example:

  • Crash ‘n’ The Boys Street Challenge
  • A Boy & His Blob
  • Tombs And Treasure
  • Disney’s Chip ‘n Dale Rescue Rangers
  • Fire N Ice

In the original search scheme, a query of “crash and the boys” would return no results because the search engine database hash contained the string “crashntheboys”, but not “crashandtheboys”. I needed a standard “and”. Therefore, before stripping out non-alphanumeric characters, the Perl script that builds the hash file first replaces occurrences of ” & “, ” n “, ” ‘n “, ” n’ “, and ” ‘n’ ” with the string ” and ” (note that the spaces before and after those strings are important). Similarly, when the search.cgi script prepares a user’s query for the hash, it performs the same replacement.

Examining the log of search queries revealed a huge problem regarding sequel games where the sequel number is denoted as a Roman numeral, e.g., Castlevania II: Simon’s Quest and Castlevania III: Dracula’s Curse. Users would enter the search string “Castlevania 3”. The search engine hash database didn’t contain a string “castlevania3”; it does contain the string “castlevaniaiii”. Therefore, the search engine reported that there were no search results. I felt this was very poor search engine behavior since it’s very unambiguous (from a human perspective) what the user was looking for. I solved this problem by converting all numbers to Roman numerals in both the hash database and the search queries. Therefore, Super Mario Bros. 3 is represented in the hash file as “supermariobrosiii”. The game “1942” is not internally represented as “1942”, but as the Roman numeral “mcmxlii”. When a user inputs the search query “Castlevania 3”, the search engine converts the string into “castlevaniaiii” and queries the hash. The proper page could now be fetched.

Finally, it became clear that although the strict searching mechanism was effective, the search engine needed to have a looser searching mechanism as well. I came to refer to these concepts as the primary and secondary searches. The primary search takes a user’s search query, lowercases it, converts numbers to Roman numerals, strips out all non-lowercase alphabetic characters and uses the resulting string to query for game IDs in the hash file. If that returns 0 or more than 1 results, the search engine attempts a secondary search which consists of lowercasing the search query, converting numbers to Roman numerals and stripping out all non-lowercase alphabetic and non-space characters. Then, the search engine splits the query up by the spaces and queries the hash for each individual word. Then it returns a list of unique results (if there were any). As an example, a confused visitor might query for “Mega Ninja Gaiden” which would fail the primary search. However, the secondary search would return all of the games with “mega” in the title, as well as games with “ninja”, and games with “gaiden”. As a more realistic example, it wouldn’t be uncommon for a visitor to search for “super mario 3” which is a common conversational way of abbreviating Super Mario Bros. 3. Again, even though it’s clear from a human’s perspective what the user is looking for, the primary search would fail miserably. But the secondary search would find the right title (and a number of other titles) if it queried the hash for “super”, then “mario”, then “iii” (remember that “3” will be converted to a Roman numeral).

The Current Search Engine

The latest iteration of the database, online as of January, 2001, contains the capability to search through game keywords and creators as well as titles. There are two hashes for each category: The first hash maps permuted strings of the mangled powerset to IDs and the second maps IDs back to full names or titles. A Perl script named makedb.pl periodically generates all of the database hash files from the MySQL database. The search engine consists of a PHP script named search.php.

Conclusions

I realize that this is not the best search engine design. I haven’t studied the specifics of other search techniques (mostly because when I started my own search engine, there were no examples to study). However, it’s important to recognize that this search scheme works for me and my meager database and I haven’t yet needed to come up with a better scheme. I analyzed the specific problem that I had and developed an adequate solution. Maybe a similar solution will work for you too.

Posted in Nintendo | 2 Comments »

2 Responses

  1. Robert Swain Says:

    The power set theory isn’t quite correct.

    For set S = {a, b, c, d} the power set, P(S) = { {}, {a}, {b}, {c}, {d}, {a,b}, {a,c}, {a,d}, {b,c}, {b,d}, {c,d}, {a,b,c}, {a,b,d}, {a,c,d}, {b,c,d}, {a,b,c,d} }

    For an n element set, the power set contains 2^n elements.

  2. Doug Pederson Says:

    Here is how my search engine evolved.

    This program started out as a simple text search.

    The flowchart, to do the line wrap and match hi-lights
    took 3 days. It worked like a charm. (see the links
    to these charts below)

    Step 1 complete. Search and display text results.

    Next was the ability to search text and call up pictures.
    When a match was found on a line and the following line had
    the path to a picture, then that picture was displayed.

    We scanned in and cataloged 5000+ family pictures and
    used that as a test base.

    Changes were made to view these pictures as a screen saver.
    Then with a random option, because sequential became boring fast.

    Then came VIDEO and Music. Being able to start and stop
    anywhere in a video clip, slow motion and freeze frame
    are the main features. I’ve had a mini dvd camcorder for
    4 years now. Desktop Search makes computers and video fun.

    The very early version (1999) of the search has way less
    functionality but is valuable as a stripped down search.
    (Code documentation, details the evolution of this program)

    To make this app a NET search, or create Linux, Mac and C versions
    should be quite simple. There are just 10,000 lines of Visual Basic 5
    code, with lots of comments and deactivated testing lines. A URL would
    be loaded instead of a picture.

    It’s very scalable. It currently runs as a background job.
    A few thousand copies searching designated data sections,
    should handle the largest of data searches.

    All my notes, emails and net clippings since 1999 amount to 75M
    At 20,000,000 cps it takes just over 3 seconds. (4 year old laptop)

    Another must have feature of any Desktop Search is notepad capabilities.
    If you don’t keep notes what do you keep?

    Before computer search, people had recipe books, phone and address books.
    They kept notes short and to the point. When looking for something they
    had to read / scan the text. With computer search, you can keep much more
    detailed notes and store it all, in one file if you like.

    Keep every bit of printed information that is worthwhile. It won’t
    amount to all that much. Mine’s next to nothing at 75M. No need for
    indexing desktop search; like Google, Yahoo and Microsoft.

    Having all your photos, music and telephone videos readily available
    and this software to play random selections can be entertaining. It’s
    quick to add search detail for these files or video segments. It can
    flash through photos at 10 per second and faster. Sample your videos,
    music, pictures and text in the same screen saver.

    This app is relatively SECURE for text. In display mode the text can
    not be cut or copied. (The extract capability would have to be disabled.)
    Having the original file encrypted and the decrytion taking place on the fly
    adds to security.

    Other features are: Merge, Search and Replace, Selected photo or song
    to copy, directory list, differences between 2 files and some lesser
    functions that make searching useful.

    I have looked at some of the others. The more I looked the more
    I loved this desktop search. The thing I like most is that all my
    searches can be done in full text mode. With search, it is ALWAYS
    the context one is looking for. I know the name “Joe Smith” when
    I search. It is his phone# or a person I met with him that I am
    interested in. Context context context. The other searches give
    you gibberish from their index files, before you see the real data.

    This search is leading the pack in simplicity, expandability, usefulness
    and speed. Gets the data up on the screen fast.

    Shortly I’ll be explaining and demonstrating this search using Video;
    from the very early source code and Flowcharts to the most recent features.

    If this Search makes it to the next level and becomes a Net Search, with
    all it’s multimedia capabilities then look out, to the rest. Random video,
    random music, random jokes, excitement.

    Flowcharts page 1 and 2 below:

    http://www.telusplanet.net/public/stonedan/pict01.jpg

    http://www.telusplanet.net/public/stonedan/pict02.jpg