Monthly Archives: December 2009

Archivist’s Burden

Because I don’t have enough unfinished projects in my life, and because I have recently fallen in with a crowd of fanatical archivists, I started to ponder the possibility of systematically archiving all of the CD-based video games that I have collected. One day, some historian is going to want to study the proliferation of wedding simulators and licensed Barbie games. Then I noticed that I have an unused 2x750GB RAID brick (configured for redundancy, so 750 GB capacity), and that I haven’t had enough bad experiences with RAID technologies yet. Put the 2 together: why not try archiving a bunch of these games on said RAID brick?

So, that’s the pitch: archive a bunch of game discs onto this RAID brick. Assumption: a ‘disc’ is a string of 2048-byte sectors comprising a CD-ROM mode 1/form 1 data track, plus 1 or more redbook CD audio tracks if they exist. Some questions:

  • What filesystem should the RAID brick be formatted in?
  • Should the data tracks be compressed? If so, what format?
  • Should the audio tracks be compressed (losslessly)? If so, what format?
  • How should the information be organized on the disc?
  • How should this process be automated?

Be advised that these are problems for which there is no “right” solution. There are many solutions and choosing one is all about evaluating trade-offs. Here are some solutions I am planning for:

  • I have formatted the brick with Linux Ext3. At one point I had the brick formatted with FAT32 so it could easily be shared between many operating systems. Unfortunately, I had a bad experience with data integrity which marked either another strike against RAID tech in general, or perhaps a faulty FAT32 driver in either Linux or Mac OS X. Further, there’s the issue that I can’t abide Mac OS X’s insistence on pooping metadata files all over every filesystem it touches. I have that problem with my pocket USB drive. At least if it’s not in a Mac-native filesystem, I won’t have to worry about that.
  • Compressing the data track should be performed using one of the usual suspects in Unix-land: gzip, bzip2, or lzma (at least, those are the ready-made solutions; I have a theory about better compression to be expounded upon in a later post). Which is best? Lzma is generally the champ in terms of size reduction. It’s useful to consider the time tradeoff. I was recently trying to compress some very large files for internet transfer and noticed that lzma took substantially longer than other options. But since this is for long term storage, it’s probably worth it to squeeze out the largest compression factor.
  • For compressing the audio tracks, the 2 frontrunners are FLAC and Apple Lossless (ALAC). FLAC is obviously in consideration because it’s free and open source and all that. However, ALAC, while not necessarily free and open source in the most pure notion, is still free and open source enough for this purpose (FFmpeg has both a decoder and a competitive encoder). Archiving using ALAC also makes it possible to easily listen to the music using iTunes, my usual audio program. Thus, I’m leaning towards ALAC here.
  • I need to come up with a directory structure for the master archive disc. I imagine something along the lines of “/game title” for single-disc games and “/game title/disc n” for multi-disc games. Each directory will have “track01.iso” and 0 or more “tracknn.m4a” files depending on the presence of audio tracks. That’s the general pattern, anyway.
  • As for automating this, it will naturally be done with a Python script of my own creation. I wager I could actually make something that queries a disc directly (Python has ioctl libraries) and then read the raw data and audio tracks. That’s how I implemented the CD playback support in xine. This time, I suspect I’ll just be invoking cdparanoia and parsing the output to determine number of tracks and for subsequent ripping.

I am thinking that the automated Python script should maintain a centralized SQLite database at the root of the archive disc. This database should keep track of archived disc locations, the ripped tracks, and their before- and after-compression sizes. Further, it should store records of each file (with relative path) in the filesystem along with its timestamp and size. Also, the database needs to log information about errors encountered during the ripping process (i.e., store the stderr text from the data copy operation). Bonus points for running the Unix ‘file’ command against each file and also storing the output. This last bit of information would be useful for finding, e.g., all of the Smacker files in every game in the archive. There are still specialized game-related formats that the ‘file’ type database will not recognize. For this reason, it may also be useful to record the first, say, 32 bytes of a file in the central database as well for later identification. Maybe that’s overkill; I’m still tossing around ideas here.

Open questions:

  1. How do I rip a data track that is not the first track of an optical disc? My standard method is to use the Unix command ‘dd if=/dev/disc of=file.iso’ but that only works if the first track contains data. I have encountered at least one disc on which the data track isn’t first. Come to think of it, I guess this is standard for “enhanced” audio CDs and this particular demo disc might have fallen into that category.
  2. How do I mount an ISO filesystem for browsing without needing root permissions? Since I want to put as much brainpower into this automated script as possible and I don’t want to run the script as root, it will be necessary to at least be able to get a listing of the directory structure, and preferable to mount the filesystem and run ‘file’ against each one. There is the option of mounting via the loopback device (maybe loosening the permissions of the loopback device will help). FUSE seems like an option, in theory, but the last time I checked, there were no ISO-9660 drivers for FUSE.

Archiving a bunch of old games seems so simple. Leave it to me to make it complicated.

One Gluttonous if Statement

I’m still haunted by the slow VP3/Theora decoder in FFmpeg. While profiling reveals that most of the decoding time is spent in a function named unpack_vlcs(), I have been informed that finer-grained profiling indicates that the vast majority of that time is spent evaluating the following if statement:

  if (coeff_counts[fragment_num] > coeff_index)
    continue;

I put counters before and after that statement and ran the good old Big Buck Bunny 1080p movie through it. These numbers indicate, per frame, the number of times the if statement is evaluated and the number of times it evaluates to true:

[theora @ 0x1004000]3133440, 50643
[theora @ 0x1004000]15360, 722
[theora @ 0x1004000]15360, 720
[theora @ 0x1004000]0, 0
[theora @ 0x1004000]13888, 434
[theora @ 0x1004000]631424, 10711
[theora @ 0x1004000]2001344, 36922
[theora @ 0x1004000]1298752, 22897
[...]

Thus, while decoding the first frame, the if statement is evaluated over 3 million times but further action (i.e., decoding a coefficient) is only performed 50,000 times. Based on the above sample, each frame sees about 2 orders of magnitude more evaluations than are necessary.

Clearly, that if statement needs to go.

Continue reading