Creating A System For Archiving CD-ROMs | Breaking Eggs And Making Omelettes

Because I don’t have enough unfinished projects in my life, and because I have recently fallen in with a crowd of fanatical archivists, I started to ponder the possibility of systematically archiving all of the CD-based video games that I have collected. One day, some historian is going to want to study the proliferation of wedding simulators and licensed Barbie games. Then I noticed that I have an unused 2x750GB RAID brick (configured for redundancy, so 750 GB capacity), and that I haven’t had enough bad experiences with RAID technologies yet. Put the 2 together: why not try archiving a bunch of these games on said RAID brick?

So, that’s the pitch: archive a bunch of game discs onto this RAID brick. Assumption: a ‘disc’ is a string of 2048-byte sectors comprising a CD-ROM mode 1/form 1 data track, plus 1 or more redbook CD audio tracks if they exist. Some questions:

What filesystem should the RAID brick be formatted in?
Should the data tracks be compressed? If so, what format?
Should the audio tracks be compressed (losslessly)? If so, what format?
How should the information be organized on the disc?
How should this process be automated?

Be advised that these are problems for which there is no “right” solution. There are many solutions and choosing one is all about evaluating trade-offs. Here are some solutions I am planning for:

I have formatted the brick with Linux Ext3. At one point I had the brick formatted with FAT32 so it could easily be shared between many operating systems. Unfortunately, I had a bad experience with data integrity which marked either another strike against RAID tech in general, or perhaps a faulty FAT32 driver in either Linux or Mac OS X. Further, there’s the issue that I can’t abide Mac OS X’s insistence on pooping metadata files all over every filesystem it touches. I have that problem with my pocket USB drive. At least if it’s not in a Mac-native filesystem, I won’t have to worry about that.
Compressing the data track should be performed using one of the usual suspects in Unix-land: gzip, bzip2, or lzma (at least, those are the ready-made solutions; I have a theory about better compression to be expounded upon in a later post). Which is best? Lzma is generally the champ in terms of size reduction. It’s useful to consider the time tradeoff. I was recently trying to compress some very large files for internet transfer and noticed that lzma took substantially longer than other options. But since this is for long term storage, it’s probably worth it to squeeze out the largest compression factor.
For compressing the audio tracks, the 2 frontrunners are FLAC and Apple Lossless (ALAC). FLAC is obviously in consideration because it’s free and open source and all that. However, ALAC, while not necessarily free and open source in the most pure notion, is still free and open source enough for this purpose (FFmpeg has both a decoder and a competitive encoder). Archiving using ALAC also makes it possible to easily listen to the music using iTunes, my usual audio program. Thus, I’m leaning towards ALAC here.
I need to come up with a directory structure for the master archive disc. I imagine something along the lines of “/game title” for single-disc games and “/game title/disc n” for multi-disc games. Each directory will have “track01.iso” and 0 or more “tracknn.m4a” files depending on the presence of audio tracks. That’s the general pattern, anyway.
As for automating this, it will naturally be done with a Python script of my own creation. I wager I could actually make something that queries a disc directly (Python has ioctl libraries) and then read the raw data and audio tracks. That’s how I implemented the CD playback support in xine. This time, I suspect I’ll just be invoking cdparanoia and parsing the output to determine number of tracks and for subsequent ripping.

I am thinking that the automated Python script should maintain a centralized SQLite database at the root of the archive disc. This database should keep track of archived disc locations, the ripped tracks, and their before- and after-compression sizes. Further, it should store records of each file (with relative path) in the filesystem along with its timestamp and size. Also, the database needs to log information about errors encountered during the ripping process (i.e., store the stderr text from the data copy operation). Bonus points for running the Unix ‘file’ command against each file and also storing the output. This last bit of information would be useful for finding, e.g., all of the Smacker files in every game in the archive. There are still specialized game-related formats that the ‘file’ type database will not recognize. For this reason, it may also be useful to record the first, say, 32 bytes of a file in the central database as well for later identification. Maybe that’s overkill; I’m still tossing around ideas here.

Open questions:

How do I rip a data track that is not the first track of an optical disc? My standard method is to use the Unix command ‘dd if=/dev/disc of=file.iso’ but that only works if the first track contains data. I have encountered at least one disc on which the data track isn’t first. Come to think of it, I guess this is standard for “enhanced” audio CDs and this particular demo disc might have fallen into that category.
How do I mount an ISO filesystem for browsing without needing root permissions? Since I want to put as much brainpower into this automated script as possible and I don’t want to run the script as root, it will be necessary to at least be able to get a listing of the directory structure, and preferable to mount the filesystem and run ‘file’ against each one. There is the option of mounting via the loopback device (maybe loosening the permissions of the loopback device will help). FUSE seems like an option, in theory, but the last time I checked, there were no ISO-9660 drivers for FUSE.

Archiving a bunch of old games seems so simple. Leave it to me to make it complicated.

9 thoughts on “Archivist’s Burden”

Reimar December 19, 2009 at 1:48 am

There is “fuseiso”, and it is even part of the latest Ubuntu version so it can’t be very obscure.
I have it installed because contrary to the in-kernel driver it also handles other image formats like NRG etc.
And for ripping, I’d suggest to try to automate one of the programs created for that task, e.g. k3b. I don’t know if/how easily any of those can be used from the commandline but it might be possible.
Adam Ehlers Nyholm Thomsen December 19, 2009 at 2:02 am

What about copy protection schemes? Assuming said historians posses some kind of emulator and want to play the games?
lockecole2 December 19, 2009 at 7:16 am

I’d use cdrdao (http://cdrdao.sf.net) to copy the disc contents to your harddisk, then use bchunk (http://he.fi/bchunk/) to split the disc into tracks, which can then be individually processed. Both are easily automatable command-line programs.

The advantage to this is that cdrdao can also correctly copy (with the right mode setting) any Mode-2 tracks that exist, which I have seen with some Japanese games.

You may also want to check out http://redump.org for their recommendations, as they’re a bit crazier about “correct” dumping than you seem to be.
Andrew December 19, 2009 at 10:48 am

Just a FYI for Mac’s insistance of putting crap files everywhere on any drive attached – while it still will do this, using a file called .metadata_never_index appears to stop the indexing and some other things it does (it still creates the “recycling” folders annoyingly though).

As far as I am awre the filesystem doesn’t matter. It’s done it to me on NTFS and FAT32 drives (never bothered with ext3, being a mainly Windows user – although maybe there’s a good enough ext3 driver for Windows. Shame about driver signing in 64 bit versions though, sigh).

Good info though, I’ll be interested when you get the python scripts done. The only issue is always copy protection – most of the time at least however you can read the data, even if then you won’t be able to use the files on them without the original CD’s (for Windows things). I’d also wonder if anyone has made any work into getting commercial DVD readers to read propitiatory DVD’s from older consoles.
Owen S December 19, 2009 at 11:21 am

I’d personally vote for FLAC for the audio, on the basis that it seems to be more popular (I have come across tracks in FLAC on the internet, but never ALAC).
Z.T. December 19, 2009 at 12:23 pm

Instead of regular lzma, for large files you should try lrzip (http://ck.kolivas.org/apps/lrzip/README.benchmarks)
Multimedia Mike Post authorDecember 19, 2009 at 2:10 pm

@Reimar: Thanks for the fuseiso tip. That might be just what I need.

@Adam: Honestly, copy protection doesn’t factor very heavily into my collection. I have played hundreds of these games and have rarely encountered copy protection schemes. Back in the old days, some developers used strange tricks to encode sectors on floppy discs such that they were difficult to copy back. That’s not especially common for CD-ROMs (if such techniques even exist).

@ lockecole2: Thanks for the links. I am aware of redump; I remember reading it and thinking that they go through way more trouble than I want to.
Reimar December 20, 2009 at 2:28 am

I think copy protection isn’t an issue because most of those games are probably from the time frame where 600 MB of hard disk storage would cost a good deal more than the games did. If the copy in and of itself is more expensive than the original, a special copy protection is quite pointless.
After that, the next step of copy protection was to just link some data multiple times from the ISO header, I think e.g. the Tomb Raider CD thus claimed to contain almost 2 GB of data. I think that was about the time when CD burners became affordable but no really sophisticated software for them was available.
Multimedia Mike Post authorDecember 20, 2009 at 8:56 pm

@Reimar: The capacity thing was certainly true at the start of the CD-ROM era. Copying a whole disc (assuming the CD-ROM was filled) to an HD was wasteful and CD burners were prohibitively expensive. But I have a fair number of games from 2000 on which lack any form of copy protection. I’m pretty sure that the people who make, e.g., educational titles aren’t as worried about 0-day piracy as some of the hotter, more hyped titles.

Comments are closed.