Monthly Archives: May 2011

Dreamcast Archival

Console homebrew communities have always had a precarious relationship with console pirates. The same knowledge and skills useful for creating homebrew programs can usually be parlayed into ripping games and cajoling a console into honoring ripped copies. For this reason, the Dreamcast homebrew community tried hard to distance itself from pirates, rippers, and other unsavory characters.


Lot of 9 volumes of the Official Sega Dreamcast Magazine

Funny how times change. While I toed the same line while I was marginally a part of the community back in the day, now I think I’m performing a service for video game archivists and historians by openly publishing the same information. I know of at least one solution already. But I think it’s possible to do much better.

Pre-existing Art
Famed Japanese game hacker BERO (FFmpeg contributors should recognize his name from a number of Dreamcast-related multimedia contributions including CRI ADX and SH-4 optimizations) crafted a program called dreamrip based on KOS’s precursor called libdream. This is the program I used to extract 4XM multimedia files from Alone in the Dark: The New Nightmare.

Fun facts: The Sega Dreamcast used special optical discs called GD-ROMs. The GD stands for ‘GigaDisc’ which implied that they could hold roughly a gigabyte of data. How long do you think it takes to transfer that much data over a serial cable operating at 115,200 bits/second (on the order of 11 Kbytes/sec)? I seem to recall entire discs requiring on the order of 27-28 hours to archive.

If only I possessed some expertise in data compression which might expedite this process.

KallistiOS’ Unwitting Help
The KallistiOS (KOS) console-oriented RTOS provides all the software infrastructure necessary for archiving (that’s what we’ll call it in this post) Dreamcast games. KOS exposes the optical disc’s filesystem via the /cd mount point on the VFS. From there, KOS provides functions for communicating with a host computer via ethernet (broadband adapter) or serial line (DC coder’s cable). To this end, KOS exposes another mount point on the VFS named /pc which allows direct access to the host PC’s filesystem.

Thus, it’s pretty straightforward to use KOS to access the files (or raw sectors) of the Dreamcast disc and then send them over the communication line to the host PC. Simple.

Compressing Before Transfer
Right away, I wonder about compiling 3 different compression libraries: libz, libbz2, and liblzma. The latter 2 are exceptionally CPU-intensive to compress. Then again, it doesn’t really matter how long the compressor takes to do its job as long as it can average better than 11 Kbytes/sec on a 200MHz Hitachi SH-4 CPU. KOS can be set up in a preemptive threading mode which means it should be possible to read sectors and compress them while keeping the UART operating at full tilt.

A 4th compression algorithm should be in play here as well: FLAC. Since some of these discs contain red book CD audio tracks that need archival, lossless audio compression should be useful.

This post serves as a rough overview for possible future experiments. Readers might have further brainstorms.

CD-R Read Speed Experiments

I want to know how fast I can really read data from a CD-R. Pursuant to my previous musings on this subject, I was informed that it is inadequate to profile reading just any file from a CD-R since data might be read faster or slower depending on whether the data is closer to the inside or the outside of the disc.

Conclusion / Executive Summary
It is 100% true that reading data from the outside of a CD-R is faster than reading data from the inside. Read on if you care to know the details of how I arrived at this conclusion, and to find out just how much speed advantage there is to reading from the outside rather than the inside.

Science Project Outline

  • Create some sample CD-Rs with various properties
  • Get a variety of optical drives
  • Write a custom program that profiles the read speed

Creating The Test Media
It’s my understanding that not all CD-Rs are created equal. Fortunately, I have 3 spindles of media handy: Some plain-looking Memorex discs, some rather flamboyant Maxell discs, and those 80mm TDK discs:



My approach for burning is to create a single file to be burned into a standard ISO-9660 filesystem. The size of the file will be the advertised length of the CD-R minus 1 megabyte for overhead– so, 699 MB for the 120mm discs, 209 MB for the 80mm disc. The file will contain a repeating sequence of 0..0xFF bytes.

Profiling
I don’t want to leave this to the vagaries of any filesystem handling layer so I will conduct this experiment at the sector level. Profiling program outline:

  • Read the CD-ROM TOC and get the number of sectors that comprise the data track
  • Profile reading the first 20 MB of sectors
  • Profile reading 20 MB of sectors in the middle of the track
  • Profile reading the last 20 MB of sectors

Unfortunately, I couldn’t figure out the raw sector reading on modern Linux incarnations (which is annoying since I remember it being pretty straightforward years ago). So I left it to the filesystem after all. New algorithm:

  • Open the single, large file on the CD-R and query the file length
  • Profile reading the first 20 MB of data, 512 kbytes at a time
  • Profile reading 20 MB of sectors in the middle of the track (starting from filesize / 2 – 10 MB), 512 kbytes at a time
  • Profile reading the last 20 MB of sectors (starting from filesize – 20MB), 512 kbytes at a time

Empirical Data
I tested the program in Linux using an LG Slim external multi-drive (seen at the top of the pile in this post) and one of my Sega Dreamcast units. I gathered the median value of 3 runs for each area (inner, middle, and outer). I also conducted a buffer flush in between Linux runs (as root: 'sync; echo 3 > /proc/sys/vm/drop_caches').

LG Slim external multi-drive (reading from inner, middle, and outer areas in kbytes/sec):

  • TDK-80mm: 721, 897, 1048
  • Memorex-120mm: 1601, 2805, 3623
  • Maxell-120mm: 1660, 2806, 3624

So the 120mm discs can range from about 10.5X all the way up to a full 24X on this drive. For whatever reason, the 80mm disc fares a bit worse — even at the inner track — with a range of 4.8X – 7X.

Sega Dreamcast (reading from inner, middle, and outer areas in kbytes/sec):

  • TDK-80mm: 502, 632, 749
  • Memorex-120mm: 499, 889, 1143
  • Maxell-120mm: 500, 890, 1156

It’s interesting that the 80mm disc performed comparably to the 120mm discs in the Dreamcast, in contrast to the LG Slim drive. Also, the results are consistent with my previous profiling experiments, which largely only touched the inner area. The read speeds range from 3.3X – 7.7X. The middle of a 120mm disc reads at about 6X.

Implications
A few thoughts regarding these results:

  • Since the very definition of 1X is the minimum speed necessary to stream data from an audio CD, then presumably, original 1X CD-ROM drives would have needed to be capable of reading 1X from the inner area. I wonder what the max read speed at the outer edges was? It’s unlikely I would be able to get a 1X drive working easily in this day and age since the earliest CD-ROM drives required custom controllers.
  • I think 24X is the max rated read speed for CD-Rs, at least for this drive. This implies that the marketing literature only cites the best possible numbers. I guess this is no surprise, similar to how monitors and TVs have always been measured by their diagonal dimension.
  • Given this data, how do you engineer an ISO-9660 filesystem image so that the timing-sensitive multimedia files live on the outermost track? In the Dreamcast case, if you can guarantee your FMV files will live somewhere between the middle and the end of the disc, you should be able to count on a bitrate of at least 900 kbytes/sec.

Source Code
Here is the program I wrote for profiling. Note that the filename is hardcoded (#define FILENAME). Compiling for Linux is a simple 'gcc -Wall profile-cdr.c -o profile-cdr'. Compiling for Dreamcast is performed in the standard KallistiOS manner (people skilled in the art already know what they need to know); the only variation is to compile with the '-D_arch_dreamcast' flag, which the default KOS environment adds anyway.

Continue reading

Programming Language Levels

I’ve been doing this programming thing for some 20 years now. Things sure do change. One change I ponder from time to time is the matter of programming language levels. Allow me to explain.

The 1990s
When I first took computer classes in the early 1990s, my texts would classify computer languages into 3 categories, or levels. The lower the level, the closer to the hardware; the higher the level, the more abstract (and presumably, easier to use). I recall that the levels went something like this:

  • High level: Pascal, BASIC, Logo, Fortran
  • Medium level: C, Forth
  • Low level: Assembly language

Keep in mind that these were the same texts which took the time to explain the history of computers from mainframes -> minicomputers -> a relatively recent phenomenon called microcomputers or “PCs”.

Somewhere in the mid-late 1990s, when I was at university, I was introduced to a new tier:

  • Very high level: Perl, shell scripting

I think there was some debate among my peers about whether C++ and Java were properly classified as high or very high level. The distinction between high and very high, in my observation, seemed to be that very high level languages had more complex data structures (at the very least, a hash / dictionary / associative array / key-value map) built into the language, as well as implicit memory management.

Modern Day
These days, the old hierarchy is apparently forgotten (much like minicomputers). I observe that there is generally a much simpler 2-tier classification:

  • Low level: C, assembly language
  • High level: absolutely every other programming language in wide use today

I find myself wondering where C++ and Objective-C fit in this classification scheme. Then I remember that it doesn’t matter and this is all academic.

Relevancy
I think about this because I have pretty much stuck to low-level programming all of my life, mostly due to my interest in game and multimedia-type programming. But the trends in computing have favored many higher level languages and programming paradigms. I woke up one day and realized that the kind of work I often do — lower level stuff — is not very common.

I’m not here to argue that low or high level is superior. You know I’m all about using the appropriate tool for the job. But I sometimes find myself caught between worlds, having the defend and explain one to the other.

  • On one hand, it’s not unusual for the multitudes of programmers working at the high level to gasp and wonder why I or anyone else would ever use C or assembly language for anything when there are so many beautiful high level languages. I patiently explain that those languages have to be written in some other language (at first) and that they need to run on some operating system and that most assuredly won’t be written in a high level language. For further reading, I refer them to Joel Spolsky’s great essay called Back to Basics which describes why it can be useful to know at least a little bit about how the computer does what it does at the lowest levels.
  • On the other hand, believe it or not, I sometimes have to defend the merits of high level languages to my low level brethren. I’ll often hear variations of, “Any program can be written in C. Using a high level language to achieve the same will create a slow and bloated solution.” I try to explain that the trade-off in time to complete the programming task weighed against the often-negligible performance hit of what is often an I/O-bound operation in the first place makes it worthwhile to use the high level language for a wide variety of tasks.

    Or I just ignore them. That’s actually the best strategy.

Cloaked Archive Wiki

Google’s Chrome browser has made me phenomenally lazy. I don’t even attempt to type proper, complete URLs into the address bar anymore. I just type something vaguely related to the address and let the search engine take over. I saw something weird when I used this method to visit Archive Team’s site:



There’s greater detail when you elect to view more results from the site:



As the administrator of a MediaWiki installation like the one that archiveteam.org runs on, I was a little worried that they might have a spam problem. However, clicking through to any of those out-of-place pages does not indicate anything related to pharmaceuticals. Viewing source also reveals nothing amiss.

I quickly deduced that this is a textbook example of website cloaking. This is when a website reports different content to a search engine than it reports to normal web browsers (humans, presumably). General pseudocode:

if (web_request.user_agent_string == CRAWLER_USER_AGENT)
  return cloaked_data;
else
  return real_data;

You can verify this for yourself using the wget command line utility:


$ wget --quiet --user-agent="Mozilla/5.0" \
http://www.archiveteam.org/index.php?title=Geocities -O - | grep \<title\>
<title>GeoCities - Archiveteam</title>

$ wget --quiet --user-agent="Googlebot/2.1" \
http://www.archiveteam.org/index.php?title=Geocities -O - | grep \<title\>
<title>Cheap xanax | Online Drug Store, Big Discounts</title>

I guess the little web prank worked because the phaux-pharma stuff got indexed. It makes we wonder if there’s a MediaWiki plugin that does this automatically.

For extra fun, here’s a site called the CloakingDetector which purports to be able to detect whether a page employs cloaking. This is just one humble observer’s opinion, but I don’t think the site works too well:
Continue reading