compression | Breaking Eggs And Making Omelettes

RAR (Roshal ARchive) is still a popular format in some corners of the internet. In fact, I procured a set of nearly 1500 RAR files that I want to use in a little project. But I didn’t want my program to have to operate directly on the RAR files which meant that I would need to recompress them to another format. Surely, one of the usual lossless compressors commonplace with Linux these days would perform better. Probably not gzip. Maybe not bzip2 either. Perhaps xz, though?

Conclusion
At first, I concluded that xz beat RAR on every single file in the corpus. But then I studied the comparison again and realized it wasn’t quite apples to apples. So I designed a new experiment.

New conclusion: RAR still beats xz on every sample in this corpus (for the record, the data could be described as executable program data mixed with reduced quality PCM audio samples).

Methodology
My experiment involved first reprocessing the archive files into a new resource archive file format and only compressing that file (rather than a set of files) using gzip, bzip2, xz, and rar at the maximum compression settings.

echo filesize,gzip,bzip2,xz,rar,filename > compressed-sizes.csv
for f in `ls /path/to/files/*`
do
  gzip -9 --stdout $f > out.gz
  bzip2 -9 --stdout $f > out.bz2
  xz -9 --stdout --check=crc32 $f > out.xz
  rar a -m5 out.rar $f
  stat --printf "%s," $f out.gz out.bz2 out.rar out.xz >> compressed-sizes.csv
  echo $f >> compressed-sizes.csv
  rm -f out.gz out.bz2 out.xz out.rar
done

Note that xz gets the option '--check=crc32' since I’m using the XZ Embedded library which requires it. It really doesn’t make a huge different in filesize.

Experimental Results
The preceding command line generates compressed-sizes.csv which goes into a Google Spreadsheet (export as CSV).

Here are the full results of the bake-off, graphed:

That’s not especially useful. Here are the top 2 contenders compared directly:

Action
Obviously, I’m unmoved by the data. There is no way I’m leaving these files in their RAR form for this project, marginal space and bandwidth savings be darned. There are other trade-offs in play here. I know there is free source code available for decompressing RAR files but the license wouldn’t mesh well with GPL source code libraries that form the core of the same project. Plus, the XZ Embedded code is already integrated and painstakingly debugged.

During this little exercise, I learned of a little site called Maximum Compression which takes experiments like the foregoing to their logical conclusion by comparing over 200 compression programs on a standard data corpus. According to the site’s summary page, there’s a library called PAQ8PX which posts the best overall scores.

Every so often, it’s a good idea to surf over to Xiph’s site to see if they have absorbed any other well-meaning multimedia-related free software projects. I’m not sure if XSPF started out as a separate effort, but it’s underneath the Xiph umbrella now. The project is billed as “the XML format for sharing playlists.” Yippee. To continue the “those who can, do…” series: Those who can, do; those who can’t, create metadata formats. Anyway, all the buzzwords are there: XML, open, portable, simple, XML. I’m surprised that it’s not an RFC yet (that I could find). I’m sure it’s only a matter of time. Going forward, all of the free multimedia players will be morally obligated to support XSPF. Be advised.

Maybe I’m just irritated because XSPF is supposed to be pronounced “spiff” which, to me, defiles the memory of Calvin.

I think those 3 letters — XML — put me off of this idea the most. Every now and then, I have entertained the idea of using XML to store or transport data for my own programs. But then I realize that I may as well just use an arbitrary binary format that is easier to parse. After all, isn’t XML just an arbitrary textual format? Actually, no. Arbitrary textual data would be easier to parse (e.g., records of data separated by carriage returns with individual fields separated by commas or some other character guaranteed not to occur in the regular data; i.e., CSV). XML requires strict structure around the arbitrary textual data.

As my esteemed multimedia hacking colleague, Attila Kinali, once articulated, “if you really think that XML is the answer, then you definitely misunderstood the question.” (Attila: Michael and I and the rest of the gang are going to make sure that quote is what you’re best known for.)

I think that XML is intuitively antithetical in the mind of the average multimedia hacker. Such an individual instinctively attempts to encode things with as few bits as possible for the express purpose of making transport or storage of that data more efficient. XML explicitly defies that notion by representing information with way more bits than are necessary.

Suddenly, I find myself wondering about representing DCT coefficient data using an XML schema — why not express a JPEG as a human-readable XML file?

[xml]
< ?xml version="1.0"?>

…
…
…
…
…

…
…

[/xml]

Don’t laugh — it would be extensible. Someone could, for example, add markup to individual macroblocks. Is it anymore outlandish than, say, specifying vector drawings as XML?

Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering

Tag Archives: compression

RAR Is Still A Contender

XSPF And XML