Pursuant to my project idea of archiving a bunch of obscure, old, CD-ROM-based games, I got an idea about possibly compressing ISO-9660 filesystems more efficiently than could be done with standard lossless compressors such as gzip, bzip2, and lzma. This is filed under “outlandish brainstorms” since I don’t intend to get around to doing this anytime in the near future, but I wanted to put the idea out there anyway.
Game developers throughout the era of optical disc-based games have been notoriously lazy about data efficiency. A typical manifestation of this is when one opens a disc to find dozens or hundreds of sizable, uncompressed WAV audio files. Same goes for uncompressed graphical assets. General lossless compressors are able to squeeze some bytes out of uncompressed audio and video data but specialized algorithms (like FLAC for audio and PNG for images) perform better.
Here’s the pitch: Create a format that analyzes individual files in an ISO filesystem and uses more appropriate compression algorithms. If there’s a 16-bit PCM WAV file, use FLAC to compress it. If there’s an uncompressed BMP file, compress as a PNG file. Stuff them all in a new file with an index and a mapping to their original files.
As an additional constraint, it obviously needs to be possible to reconstruct a bit-exact ISO from one of these compressed images. So the index will need to store information about what sector ranges different files occupied.
The only other attempt I have seen for a specialized ISO compression format is the CISO format used for compressing ISO filesystems for storage on flash memory to be read in PSP units. That format uses zlib to compress sector by sector which has the advantage of being able to mount and use a CISO image in place without decompressing the entire filesystem. That should be possible for this filesystem a well by using a FUSE driver. Mount the filesystem and the driver presents the file tree upon request and decompressed the files on demand.
Many compression algorithms have assorted methods of operation depending on what is appropriate for a particular range of data. This scheme is merely an extension of that principle. I have no idea if this idea would hold water in the general case. But thanks to my archiving brainstorm, I expect I will have quite a lot of data to analyze.
See Also:
- Archivist’s Burden, where I laid out the goals for archiving these old games
- CISO Technology
It’s worth pointing out that that could work and be valuable for *any* archive format or logical filesystem image*, not just ISO9660.
(Entering my own outlandish brainstorm)
For some circumstances, it might be worthwhile testing for existing archive formats within the image, decompress them and examine their contents, decompress as possible, check for duplicates, hardlink and recompress.
That would work particularly well for things such as backup discs and archives with huge numbers of redundant files like curate-and sell software collections on CD.
* A dd copy of /dev/sda1 might not have the same punch, as you’ll have a lot of noise between files. Though a tool to make a logical copy of an existing filesystem using a filesystem like ext4 or xfs might be interesting in its own way. Set in place, expand, chroot and run? Sounds like it would work better than mkfs, cp and reboot.
People have tried this before. You end up with Yet Another Custom Compression Format and marginally smaller files. Seriously, 1tb USB drives are like $100! Unless this is a lot of fun, your time can’t be worth too much.
Practicality has little bearing on what I brainstorm about on this blog. :-)
I always enjoy reading your posts, Mike, they consistently apply to my inner geek. The idea itself is still relevant, because the same reasoning can be applied to compression of DVD’s and BlueRay’s.
Well, given unlimited processing power you could just try every possible combination of compressing sectors with FLAC, gzip, bzip2,…
However gzip usually works fairly well for both binaries (here usually better than bzip2, at least when I last tried) and easily compressable stuff.
And I’m not sure if using FLAC/ALAC gives any relevant compression advantage, IMO the main advantage is that the files remain directly playable.
StuffIt do a very agressive compression identifying specific file formats, including jpg and other multimedia types. It is a little expensive but can worth your time and the enourmous headache to do that type of software
@Reimar: Check out our repository of lossless audio samples: http://samples.mplayerhq.hu/A-codecs/lossless/
All of the specialized lossless audio methods significantly outperform the general-purpose bzip2 algorithm for this 1-minute CD audio sample. Then again, there are a lot of games that have a series of rather short sound effects stored as uncompressed WAV files. Using a lossless coding algorithm on these files could hit the point of diminishing returns in a hurry.
Of course, this plot somewhat hinges on the assumption that the audio data is 16-bit in resolution which won’t be true for older games. I wonder if there are better compression ideas for compressing 8-bit PCM data? Maybe a Huffman-coded DPCM technique (similar to what Smacker audio does).
I was recently messing around with MAME, and noticed they have their own format for storing compressed, random-access CD and HD images. It has the extension .chd, but I don’t know much more beyond that.
@Michael: Interesting stuff. I see that the relevant files are in tools/chdch.[c|h] as well as lib/util/chd.[c|h]. chd.h lists these compression types:
/* compression types */
#define CHDCOMPRESSION_NONE 0
#define CHDCOMPRESSION_ZLIB 1
#define CHDCOMPRESSION_ZLIB_PLUS 2
#define CHDCOMPRESSION_AV 3
chd.c maps _ZLIB and _ZLIB_PLUS to the standard zlib libraries. _AV maps to an internal compression format that is described in the comments of avcomp.c. Huffman figures heavily, but there’s not much else. The comments indicate that the author attempted some DCTs and prediction methods but didn’t see any improvement.
You could just compress the images with a paq8 variant. I just compressed the luckynight.wav test file with paq8p level 8 and the result was ~6.2 MB (lossless, of course).
The only drawback is that compression (and also decompression due to the way the algorithm works) is slooow and quite memory-intensive.
Meanwhile, 6.2 MB is on par with nearly every other lossless audio compressor, which are also very fast.
So many trade-offs to consider.