Category Archives: General

Re-solving My Search Engine Problem

14 years ago, I created a web database of 8-bit Nintendo Entertainment System games. To make it useful, I developed a very primitive search feature.

A few months ago, I decided to create a web database of video game music. To make it useful, I knew it would need to have a search feature. I realized I needed to solve the exact same problem again.

Requirements
The last time I solved this problem, I came up with an excruciatingly naïve idea. Hey, it worked. I really didn’t want to deploy the same solution again because it felt so silly the first time. Surely there are many better ways to solve it now? Many different workable software solutions that do all the hard work for me?

The first time I attacked this, it was 1998 and hosting resources were scarce. On my primary web host I was able to put static HTML pages, perhaps with server side includes. The web host also offered dynamic scripting capabilities via something called htmlscript (a.k.a. MIVA Script). I had a secondary web host in my ISP which allowed me to host conventional CGI scripts on a Unix host, so that’s where I hosted the search function (Perl CGI script accessing a key/value data store file).

Nowadays, sky’s the limit. Any type of technology you want to deploy should be tractable. Still, a key requirement was that I didn’t want to pay for additional hosting resources for this silly little side project. That leaves me with options that my current shared web hosting plan allows, which includes such advanced features as PHP, Perl and Python scripts. I can also access MySQL.

Candidates
There are a lot of mature software packages out there which can index and search data and be plugged into a website. But a lot of them would be unworkable on my web hosting plan due to language or library package limitations. Further, a lot of them feel like overkill. At the most basic level, all I really want to do is map a series of video game titles to URLs in a website.

Based on my research, Lucene seems to hold a fair amount of mindshare as an open source indexing and search solution. But I was unsure of my ability to run it on my hosting plan. I think MySQL does some kind of full text search, so I could have probably made a solution around that. Again, it just feels like way more power than I need for this project.

I used Swish-e once about 3 years ago for a little project. I wasn’t confident of my ability to run that on my server either. It has a Perl API but it requires custom modules.

My quest for a search solution grew deep enough that I started perusing a textbook on information retrieval techniques in preparation for possibly writing my own solution from scratch. However, in doing so, I figured out how I might subvert an existing solution to do what I want.

Back to Swish-e
Again, all I wanted to do was pull data out of a database and map that data to a URL in a website. Reading the Swish-e documentation, I learned that the software supports a mode specifically tailored for this. Rather than asking Swish-e to index a series of document files living on disk, you can specify a script for Swish-e to run and the script will generate what appears to be a set of phantom documents for Swish-e to index.
Continue reading

Zlib vs. XZ on 2SF

I recently released my Game Music Appreciation website. It allows users to play an enormous range of video game music directly in their browsers. To do this, the site has to host the music. And since I’m a compression bore, I have to know how small I can practically make these music files. I already published the results of my effort to see if XZ could beat RAR (RAR won, but only slightly, and I still went with XZ for the project) on the corpus of Super Nintendo chiptune sets. Next is the corpus of Nintendo DS chiptunes.

Repacking Nintendo DS 2SF
The prevailing chiptune format for storing Nintendo DS songs is the .2sf format. This is a subtype of the Portable Sound Format (PSF). The designers had the foresight to build compression directly into the format. Much of payload data in a PSF file is compressed with zlib. Since I already incorporated Embedded XZ into the player project, I decided to try repacking the PSF payload data from zlib -> xz.

In an effort to not corrupt standards too much, I changed the ‘PSF’ file signature (seen in the first 3 bytes of a file) to ‘psf’.

Results
There are about 900 Nintendo DS games currently represented in my website’s archive. Total size of the original PSF archive, payloads packed with zlib: 2.992 GB. Total size of the same archive with payloads packed as xz: 2.059 GB.

Using xz vs. zlib saved me nearly a gigabyte of storage. That extra storage doesn’t really impact my hosting plan very much (I have 1/2 TB, which is why I’m so nonchalant about hosting the massive MPlayer Samples Archive). However, smaller individual files translates to a better user experience since the files are faster to download.

Here is a pretty picture to illustrate the space savings:



The blue occasionally appears to dip below the orange but the data indicates that xz is always more efficient than zlib. Here’s the raw data (comes in vanilla CSV flavor too).

Interface Impact
So the good news for the end user is that the songs are faster to load up front. The downside is that there can be a noticeable delay when changing tracks. Even though all songs are packaged into one file for download, and the entire file is downloaded before playback begins, each song is individually compressed. Thus, changing tracks triggers another decompression operation. I’m toying the possibility of some sort of background process that decompresses song (n+1) while playing song (n) in order to help compensate for this.

I don’t like the idea of decompressing everything up front because A) it would take even longer to start playing; and B) it would take a huge amount of memory.

Corner Case
There was at least one case in which I found zlib to be better than xz. It looks like zlib’s minimum block size is smaller than xz’s. I think I discovered xz to be unable to compress a few bytes to a block any smaller than about 60-64 bytes while zlib got it down into the teens. However, in those cases, it was more efficient to just leave the data uncompressed anyway.

Winamp and the March of GUI

Ars Technica recently published a 15-year retrospective on the venerable Winamp multimedia player, prompting bouts of nostalgia and revelations of “Huh? That program is still around?” from many readers. I was among them.



I remember first using Winamp in 1997. I remember finding a few of these new files called MP3s online and being able to play the first 20 seconds using the official Fraunhofer Windows player– full playback required the fully licensed version. Then I searched for another player and came up with Winamp. The first version I ever used was v1.05 in the summer of 1997. I remember checking the website often for updates and trying out every single one. I can’t imagine doing that nowadays– programs need to auto-update themselves (which Winamp probably does now; I can’t recall the last time I used the program).

Video Underdog
The last time Winamp came up on my radar was early in 2003 when a new version came with support for a custom, proprietary multimedia audio/video format called Nullsoft Video (NSV). I remember the timeframe because the date is indicated in the earliest revision of my NSV spec document (back when I was maintaining such docs in a series of plaintext files). This was cobbled together from details I and others in the open source multimedia community sorted out from sample files. It was missing quite a few details, though.

Then, Winamp founder Justin Frankel — introduced through a colleague on the xine team — emailed me his official NSV format and told me I was free to incorporate details into my document just as long as it wasn’t obvious that I had the official spec. This put me in an obnoxious position of trying to incorporate details which would have been very difficult to reverse engineer without the official doc. I think I coped with the situation by never really getting around to updating my doc in any meaningful way. Then, one day, the official spec was released to the world anyway, and it is now mirrored here at multimedia.cx.

I don’t think the format ever really caught on in any meaningful way, so not a big deal. (Anytime I say that about a format, I always learn it saw huge adoption is some small but vocal community.)

What’s Wrong With This Picture?
What I really wanted to discuss in this post was the matter of graphical user interfaces and how they have changed in the last 15 years.
Continue reading

Size Discrepany in the ‘du’ Command

I had a problem today while using the common Unix command ‘du’. As a refresher, ‘du’ stands for disk usage and is a handy tool for understanding how much disk space is being occupied.

I think ‘du’ is probably doing the right thing. The problem might be that I’m getting strange (read: 1/2 the expected number) when running the tool against directories on vmhgfs, the VMware filesystem.

Science Project
On an Ubuntu Linux VMware session, my home directory is on the main file system, which is ext4. The directory /mnt/hgfs is reported by ‘mount’ to be of type vmhgfs and is shared with the host machine.

Create a directory in the home directory and generate a 10 MiB file:

mkdir /home/melanson/dir
dd if=/dev/urandom of=/home/melanson/dir/random-file bs=1048576 count=10

Create a directory on the shared drive and copy the same file:

mkdir /mnt/hgfs/vmshare/dir
cp /home/melanson/dir/random-file /mnt/hgfs/vmshare/dir

Run ‘du’ on each directory using the -k and -h options:

du -k /home/melanson/dir /mnt/hgfs/vmshare/dir
10244   /home/melanson/dir
5120    /mnt/hgfs/vmshare/dir

du -h /home/melanson/dir /mnt/hgfs/vmshare/dir
11M    /home/melanson/directory
5.0M   /mnt/hgfs/vmshare/directory

I noticed this discrepancy when I was trying to pack a set of files (akin to ‘tar’-ing) living in a directory in the shared location. I was going mad trying to understand why the original directory was only 2 MB as reported by ‘du’ but the final packed file was 4 MB.

To be fair, the man page for ‘du’ succinctly states that the tool’s purpose is merely to estimate file space usage”.