Category Archives: Science Projects

RAR Is Still A Contender

RAR (Roshal ARchive) is still a popular format in some corners of the internet. In fact, I procured a set of nearly 1500 RAR files that I want to use in a little project. But I didn’t want my program to have to operate directly on the RAR files which meant that I would need to recompress them to another format. Surely, one of the usual lossless compressors commonplace with Linux these days would perform better. Probably not gzip. Maybe not bzip2 either. Perhaps xz, though?

Conclusion
At first, I concluded that xz beat RAR on every single file in the corpus. But then I studied the comparison again and realized it wasn’t quite apples to apples. So I designed a new experiment.

New conclusion: RAR still beats xz on every sample in this corpus (for the record, the data could be described as executable program data mixed with reduced quality PCM audio samples).

Methodology
My experiment involved first reprocessing the archive files into a new resource archive file format and only compressing that file (rather than a set of files) using gzip, bzip2, xz, and rar at the maximum compression settings.

echo filesize,gzip,bzip2,xz,rar,filename > compressed-sizes.csv
for f in `ls /path/to/files/*`
do
  gzip -9 --stdout $f > out.gz
  bzip2 -9 --stdout $f > out.bz2
  xz -9 --stdout --check=crc32 $f > out.xz
  rar a -m5 out.rar $f
  stat --printf "%s," $f out.gz out.bz2 out.rar out.xz >> compressed-sizes.csv
  echo $f >> compressed-sizes.csv
  rm -f out.gz out.bz2 out.xz out.rar
done

Note that xz gets the option '--check=crc32' since I’m using the XZ Embedded library which requires it. It really doesn’t make a huge different in filesize.

Experimental Results
The preceding command line generates compressed-sizes.csv which goes into a Google Spreadsheet (export as CSV).

Here are the full results of the bake-off, graphed:



That’s not especially useful. Here are the top 2 contenders compared directly:



Action
Obviously, I’m unmoved by the data. There is no way I’m leaving these files in their RAR form for this project, marginal space and bandwidth savings be darned. There are other trade-offs in play here. I know there is free source code available for decompressing RAR files but the license wouldn’t mesh well with GPL source code libraries that form the core of the same project. Plus, the XZ Embedded code is already integrated and painstakingly debugged.

During this little exercise, I learned of a little site called Maximum Compression which takes experiments like the foregoing to their logical conclusion by comparing over 200 compression programs on a standard data corpus. According to the site’s summary page, there’s a library called PAQ8PX which posts the best overall scores.

CD-R Read Speed Experiments

I want to know how fast I can really read data from a CD-R. Pursuant to my previous musings on this subject, I was informed that it is inadequate to profile reading just any file from a CD-R since data might be read faster or slower depending on whether the data is closer to the inside or the outside of the disc.

Conclusion / Executive Summary
It is 100% true that reading data from the outside of a CD-R is faster than reading data from the inside. Read on if you care to know the details of how I arrived at this conclusion, and to find out just how much speed advantage there is to reading from the outside rather than the inside.

Science Project Outline

  • Create some sample CD-Rs with various properties
  • Get a variety of optical drives
  • Write a custom program that profiles the read speed

Creating The Test Media
It’s my understanding that not all CD-Rs are created equal. Fortunately, I have 3 spindles of media handy: Some plain-looking Memorex discs, some rather flamboyant Maxell discs, and those 80mm TDK discs:



My approach for burning is to create a single file to be burned into a standard ISO-9660 filesystem. The size of the file will be the advertised length of the CD-R minus 1 megabyte for overhead– so, 699 MB for the 120mm discs, 209 MB for the 80mm disc. The file will contain a repeating sequence of 0..0xFF bytes.

Profiling
I don’t want to leave this to the vagaries of any filesystem handling layer so I will conduct this experiment at the sector level. Profiling program outline:

  • Read the CD-ROM TOC and get the number of sectors that comprise the data track
  • Profile reading the first 20 MB of sectors
  • Profile reading 20 MB of sectors in the middle of the track
  • Profile reading the last 20 MB of sectors

Unfortunately, I couldn’t figure out the raw sector reading on modern Linux incarnations (which is annoying since I remember it being pretty straightforward years ago). So I left it to the filesystem after all. New algorithm:

  • Open the single, large file on the CD-R and query the file length
  • Profile reading the first 20 MB of data, 512 kbytes at a time
  • Profile reading 20 MB of sectors in the middle of the track (starting from filesize / 2 – 10 MB), 512 kbytes at a time
  • Profile reading the last 20 MB of sectors (starting from filesize – 20MB), 512 kbytes at a time

Empirical Data
I tested the program in Linux using an LG Slim external multi-drive (seen at the top of the pile in this post) and one of my Sega Dreamcast units. I gathered the median value of 3 runs for each area (inner, middle, and outer). I also conducted a buffer flush in between Linux runs (as root: 'sync; echo 3 > /proc/sys/vm/drop_caches').

LG Slim external multi-drive (reading from inner, middle, and outer areas in kbytes/sec):

  • TDK-80mm: 721, 897, 1048
  • Memorex-120mm: 1601, 2805, 3623
  • Maxell-120mm: 1660, 2806, 3624

So the 120mm discs can range from about 10.5X all the way up to a full 24X on this drive. For whatever reason, the 80mm disc fares a bit worse — even at the inner track — with a range of 4.8X – 7X.

Sega Dreamcast (reading from inner, middle, and outer areas in kbytes/sec):

  • TDK-80mm: 502, 632, 749
  • Memorex-120mm: 499, 889, 1143
  • Maxell-120mm: 500, 890, 1156

It’s interesting that the 80mm disc performed comparably to the 120mm discs in the Dreamcast, in contrast to the LG Slim drive. Also, the results are consistent with my previous profiling experiments, which largely only touched the inner area. The read speeds range from 3.3X – 7.7X. The middle of a 120mm disc reads at about 6X.

Implications
A few thoughts regarding these results:

  • Since the very definition of 1X is the minimum speed necessary to stream data from an audio CD, then presumably, original 1X CD-ROM drives would have needed to be capable of reading 1X from the inner area. I wonder what the max read speed at the outer edges was? It’s unlikely I would be able to get a 1X drive working easily in this day and age since the earliest CD-ROM drives required custom controllers.
  • I think 24X is the max rated read speed for CD-Rs, at least for this drive. This implies that the marketing literature only cites the best possible numbers. I guess this is no surprise, similar to how monitors and TVs have always been measured by their diagonal dimension.
  • Given this data, how do you engineer an ISO-9660 filesystem image so that the timing-sensitive multimedia files live on the outermost track? In the Dreamcast case, if you can guarantee your FMV files will live somewhere between the middle and the end of the disc, you should be able to count on a bitrate of at least 900 kbytes/sec.

Source Code
Here is the program I wrote for profiling. Note that the filename is hardcoded (#define FILENAME). Compiling for Linux is a simple 'gcc -Wall profile-cdr.c -o profile-cdr'. Compiling for Dreamcast is performed in the standard KallistiOS manner (people skilled in the art already know what they need to know); the only variation is to compile with the '-D_arch_dreamcast' flag, which the default KOS environment adds anyway.

Continue reading

Monster Battery Power Revisited

So I have this new fat netbook battery and I performed an experiment to determine how long it really lasts. In my last post on the matter, it was suggested that I should rely on the information that gnome-power-manager is giving me. However, I have rarely seen GPM report more than about 2 hours of charge; even on a full battery, it only reports 3h25m when I profiled it as lasting over 5 hours in my typical use. So I started digging to understand how GPM gets its numbers and determine if, perhaps, it’s not getting accurate data from the system.

I started poking around /proc for the data I wanted. You can learn a lot in /proc as long as you know the right question to ask. I had to remember what the power subsystem is called — ACPI — and this led me to /proc/acpi/battery/BAT0/state which has data such as:

present:                 yes
capacity state:          ok
charging state:          charged
present rate:            unknown
remaining capacity:      100 mAh
present voltage:         8326 mV

“Remaining capacity” rated in mAh is a little odd; I would later determine that this should actually be expressed as a percentage (i.e., 100% charge at the time of this reading). Examining the GPM source code, it seems to determine as a function of the current CPU load (queried via /proc/stat) and the battery state queried via a facility called devicekit. I couldn’t immediately find any source code to the latter but I was able to install a utility called ‘devkit-power’. Mostly, it appears to rehash data already found in the above /proc file.

Curiously, the file /proc/acpi/battery/BAT0/info, which displays essential information about the battery, reports the design capacity of my battery as only 4400 mAh which is true for the original battery; the new monster battery is supposed to be 10400 mAh. I can imagine that all of these data points could be conspiring to under-report my remaining battery life.

Science project: Repeat the previous power-related science project but also parse and track the remaining capacity and present voltage fields from the battery state proc file.

Let’s skip straight to the results (which are consistent with my last set of results in terms of longevity):



So there is definitely something strange going on with the reporting– the 4400 mAh battery reports discharge at a linear rate while the 10400 mAh battery reports precipitous dropoff after 60%.

Another curious item is that my script broke at first when there was 20% power remaining which, as you can imagine, is a really annoying time to discover such a bug. At that point, the “time to empty” reported by devkit-power jumped from 0 seconds to 20 hours (the first state change observed for that field).

Here’s my script, this time elevated from Bash script to Python. It requires xdotool and devkit-power to be installed (both should be available in the package manager for a distro).
Continue reading

Monster Netbook Battery

I stubbornly refuse to give up my classic Asus Eee PC 701, one of the original netbooks. It’s 2.5 years old now but still serving me well. While these are supposed to be fairly disposable machines, I’m actually using this thing more and more these days (longer commute may have something to do with it). I decided to upgrade the battery from the included one (4400 mAh, rated for 2-2.5 hours). 7200 mAh batteries abounded for this Eee PC model but I decided to go crazy and buy the 10400 mAh battery.

And it’s huge. No one can keep a straight face when gazing upon this beast.



Naturally, I’m curious whether this battery is actually that much better. I searched to find if there are any established methodologies for testing battery life. It seems that the most established method is the most intuitive method, scientifically: Find a way to simulate typical usage and measure how long it takes before the machine dies from lack of battery charge.

Methodology Continue reading