RAR (Roshal ARchive) is still a popular format in some corners of the internet. In fact, I procured a set of nearly 1500 RAR files that I want to use in a little project. But I didn’t want my program to have to operate directly on the RAR files which meant that I would need to recompress them to another format. Surely, one of the usual lossless compressors commonplace with Linux these days would perform better. Probably not gzip. Maybe not bzip2 either. Perhaps xz, though?
At first, I concluded that xz beat RAR on every single file in the corpus. But then I studied the comparison again and realized it wasn’t quite apples to apples. So I designed a new experiment.
New conclusion: RAR still beats xz on every sample in this corpus (for the record, the data could be described as executable program data mixed with reduced quality PCM audio samples).
My experiment involved first reprocessing the archive files into a new resource archive file format and only compressing that file (rather than a set of files) using gzip, bzip2, xz, and rar at the maximum compression settings.
echo filesize,gzip,bzip2,xz,rar,filename > compressed-sizes.csv for f in `ls /path/to/files/*` do gzip -9 --stdout $f > out.gz bzip2 -9 --stdout $f > out.bz2 xz -9 --stdout --check=crc32 $f > out.xz rar a -m5 out.rar $f stat --printf "%s," $f out.gz out.bz2 out.rar out.xz >> compressed-sizes.csv echo $f >> compressed-sizes.csv rm -f out.gz out.bz2 out.xz out.rar done
Note that xz gets the option
'--check=crc32' since I’m using the XZ Embedded library which requires it. It really doesn’t make a huge different in filesize.
Here are the full results of the bake-off, graphed:
That’s not especially useful. Here are the top 2 contenders compared directly:
Obviously, I’m unmoved by the data. There is no way I’m leaving these files in their RAR form for this project, marginal space and bandwidth savings be darned. There are other trade-offs in play here. I know there is free source code available for decompressing RAR files but the license wouldn’t mesh well with GPL source code libraries that form the core of the same project. Plus, the XZ Embedded code is already integrated and painstakingly debugged.
During this little exercise, I learned of a little site called Maximum Compression which takes experiments like the foregoing to their logical conclusion by comparing over 200 compression programs on a standard data corpus. According to the site’s summary page, there’s a library called PAQ8PX which posts the best overall scores.