RAR (Roshal ARchive) is still a popular format in some corners of the internet. In fact, I procured a set of nearly 1500 RAR files that I want to use in a little project. But I didn’t want my program to have to operate directly on the RAR files which meant that I would need to recompress them to another format. Surely, one of the usual lossless compressors commonplace with Linux these days would perform better. Probably not gzip. Maybe not bzip2 either. Perhaps xz, though?
At first, I concluded that xz beat RAR on every single file in the corpus. But then I studied the comparison again and realized it wasn’t quite apples to apples. So I designed a new experiment.
New conclusion: RAR still beats xz on every sample in this corpus (for the record, the data could be described as executable program data mixed with reduced quality PCM audio samples).
My experiment involved first reprocessing the archive files into a new resource archive file format and only compressing that file (rather than a set of files) using gzip, bzip2, xz, and rar at the maximum compression settings.
echo filesize,gzip,bzip2,xz,rar,filename > compressed-sizes.csv for f in `ls /path/to/files/*` do gzip -9 --stdout $f > out.gz bzip2 -9 --stdout $f > out.bz2 xz -9 --stdout --check=crc32 $f > out.xz rar a -m5 out.rar $f stat --printf "%s," $f out.gz out.bz2 out.rar out.xz >> compressed-sizes.csv echo $f >> compressed-sizes.csv rm -f out.gz out.bz2 out.xz out.rar done
Note that xz gets the option
'--check=crc32' since I’m using the XZ Embedded library which requires it. It really doesn’t make a huge different in filesize.
The preceding command line generates compressed-sizes.csv which goes into a Google Spreadsheet (export as CSV).
Here are the full results of the bake-off, graphed:
That’s not especially useful. Here are the top 2 contenders compared directly:
Obviously, I’m unmoved by the data. There is no way I’m leaving these files in their RAR form for this project, marginal space and bandwidth savings be darned. There are other trade-offs in play here. I know there is free source code available for decompressing RAR files but the license wouldn’t mesh well with GPL source code libraries that form the core of the same project. Plus, the XZ Embedded code is already integrated and painstakingly debugged.
During this little exercise, I learned of a little site called Maximum Compression which takes experiments like the foregoing to their logical conclusion by comparing over 200 compression programs on a standard data corpus. According to the site’s summary page, there’s a library called PAQ8PX which posts the best overall scores.
1) RAR can employ two different compression methods: LZ77-derived (with optional multimedia preprocessing too!) and PPM variant. XZ and such is just LZMA (LZ77-derived with arithmetic compression).
2) I’d pick .xz also for another reason – decompression speed is much higher for it as I remember. Disk space is cheap nowadays anyway.
I’d actually compare it all to lossless audio codecs :)
Everybody writes its own compression scheme based on limited set of known methods complete with own container(archive) format. And we have things like FLAC (slow compression, fast decompression – like deflate or LZMA) and like APE (slow compression, slow decompression – most of modeling methods like in PAQ).
You can try passing the -e option this xz, this enables the “extreme mode”.
ZPAQ is superior to PAQ8 (http://mattmahoney.net/dc/zpaq.html)
LRzip can use zpaq (http://ck.kolivas.org/apps/lrzip/)
As mentioned, “extreme” mode has a better compression ratio.
The Unarchiver ( http://unarchiver.c3.cx/ ) has LGPL 2.1+ Objective-C code and utilities for RAR unpacking and I’d recommend it over the non-free UnRar. But of course avoiding RAR is the best choice.
From the spreadsheet, the data seems still to be compressible, so it may not be a matter of better handling the incompressible parts, as could be the case in LZ77 vs LZMA2.
On the contrary, some files have their size divided by more than 10 after compression, which makes them very compressible. In such a situation (mostly with text files), I observed that PPMd algorithm in 7zip is better than LZMA.
Also, 7zip includes a BCJ transform improving executable compression. I don’t know if it is included in the LZMA SDK (provided by 7zip and on which xz relies), but I would think 7zip performs better. I would however agree it is not as ubiquitous, although most linux distros provide an implementation.
But at least that would explain some lacks of xz over rar (and 7zip).
Rar supports a “solid” mode which treats everything as one large stream to maximize dictionary hits, and sorts by content. Try running your tests again with -s and see how much that improves things.
The Windows version has content filtering (ie. PCM and WAV files are compressed by their deltas and not absolute sample values) so you might want to try a further test with that. I don’t know if the unix variants have that.
I still use rar for my personal archiving because it tacks on parity data that can be used to reconstruct damaged bits. I know that I can use par2 with 7zip archives but that’s an extra step that I don’t want to be bothered with.
7-zip/xz can DRASTICALLY outperform rar if no content-type filtering is required and you bump up the dictionary drastically. I’ve seen real improvements going from a 64M dictionary to a 1G dictionary (although this requires 10G of RAM during compression, and also requires that your audience has 1G of RAM minimum available for decompression). Rar is limited to a 4M dictionary for backward compatibility reasons, something I wish they would address.
I think that 7-zip has solid mode too.
ZPAQ is not a compressor but a description of compressors (e.g. you can use it to describe the various PAQ8 algorithms).
If you want to make 100 easy bucks there is also http://marknelson.us/2006/06/20/million-digit-challenge/ (mark nelson’s website has some useful resources for compression)
Hi, nice comparison but maybe testing LRzip can improve things further.
As this work http://repo-ck.com/bench/lrzip_comparison_to_xz.pdf compared
lrzip to xz with a conclusion like this “On the whole, compression/decompression rates are faster for lrzip.”