I have been working on that challenge to play back video on the Sega Dreamcast. To review, I asserted that the RoQ format would be a good fit for the Sega Dreamcast hardware. The goal was to play 640×480 video at 30 frames/second. Short version: I have determined that it is possible to decode such video in real time. However, I ran into certain data rate caveats.
First off: Have you ever wondered if the Dreamcast can read an 80mm optical disc? It can! I discovered this when I only had 60 MB of RoQ samples to burn on a disc and a spindle full of these 210MB-capacity 80mm CD-Rs that I never have occasion to use.
New RoQ Library
There are open source RoQ decoders out there but I decided to write a new one. A few reasons: 1) RoQ is so simple that I didn’t think it would take too long; 2) it would be nice to have a RoQ library that is license-compatible (BSD-like) with the rest of the KallistiOS distribution; 3) the idroq.tar.gz distribution, while license-compatible, has enough issues that I didn’t want to correct it.
Thankfully, I was correct about the task not being too difficult: I put together a new RoQ decoder in short order. I’m a bit embarrassed to admit that the part I had the most trouble with was properly converting YUV -> RGB.
About the approach I took: While the original idroq.tar.gz decoder maintains YUV 4:2:0 codebooks (which led to chroma bugs during motion compensation) and FFmpeg’s decoder maintains YUV 4:4:4 codebooks, this decoder is built to convert the YUV 4:2:0 vectors into RGB565 vectors during the vector unpacking phase. Thus, the entire frame is rendered in RGB565 — no lengthy YUV -> RGB conversion after decoding — and all pixels are shuffled around as 16-bit units (minor speedup vs. shuffling everything as bytes).
I also entertained the idea of maintaining YUYV codebooks (since the DC supports that colorspace as a texture format). But I scrapped that idea when I remembered it would lead to the same chroma bleeding problem seen in the original idroq.tar.gz decoder.
Onto The Dreamcast
I developed the library on a Linux computer, allowing it to output a series of PNM files for visual verification and debugging. Dropping it into a basic DC/KOS-compatible program was trivial and the first order of business was profiling.
At first, I profiled the entire decode operation: open file, then read and decode each chunk while tossing away the results. I was roundly disappointed to see that, e.g., an 8.5-second RoQ sample needed a little more than 20 seconds to complete. Not real time. I performed a series of optimizations on the decoding library that netted notable performance gains when profiling on Linux. When I brought these same optimizations over to the DC, decoding time didn’t improve at all. This was my first suspicion that perhaps my assumptions regarding the DC’s optical drive’s data rate were not correct.
Dreamcast Data Rate Profiling
Let’s start with some definitions: In terms of data rate, an ‘X’, i.e., 1X is the minimum data rate needed to read CD quality audio from a disc. At that speed, a drive should be able to stream 75 sectors each second. When reading mode 1/form 1 CD-ROM data, each sector has 2048 bytes (2 kbytes), so a single-speed data rate should achieve 150 kbytes/sec.
The Dreamcast is supposed to possess a 12X optical drive. This would imply a maximum data rate of 150 kbytes/sec * 12 = 1800 kbytes/sec.
Rigging up a trivial experiment using the RoQ samples burned on a few different CD-R discs, the best data rate I can see is about 500-525 kbytes/sec, or around 3.5X.
Where’s the discrepancy? My first theory has to do with the fact that not all optical media is created equal. This is why optical drives often advertise a slew of numbers which refer to the best theoretical speed for reading a CD vs. writing a CD-R vs. writing a CD-RW, etc. Perhaps the DC drive can’t read CD-Rs very quickly. To test this theory, I tried streaming a large file from a conventionally mastered CD-ROM. This worked well for the closest CD-ROM I had on hand: I was able to stream data at a rate that works out to about 6.5X.
I smell a science project for another evening: Profiling read speeds from a mastered CD-ROM, burned CD-R, and also a mastered GD-ROM, on each of the 3 Dreamcast consoles I possess (I’ve heard that there’s variance between optical drives depending on manufacturing run).
The Good News
I added a little finer-grained code to profile just the video decoding functions. The good news is that the decoder meets my real time goals: That 8.5-second RoQ sample encoded at 640x480x30fps makes its way through the video decoding functions on the DC in a little less than 5 seconds. If the optical drive can supply the data fast enough, the video decoder can take care of the rest.
The RoQ encoder included with FFmpeg does not honor any bitrate parameters. Instead, I encoded the same file at 320×240. It reportedly decoded in real time and can be streamed in real time as well.
I say “reportedly” because I’m simply working from textual output at this point; the next phase is to hook the decoder up to the display hardware.
12X was the maximum rate for GDROMs; long-time dreamcast pirates (cough) know full well that CDRs were nearly half that speed, which is why a lot of dreamcast pirate versions put all of the data on the outermost tracks, because they’re faster. Even a small game of, say, 150MB still used up an entire 80-minute CDR with the 150MB on the outside for the most speed.
I too wanted to suggest to dump as much data _before_ the video file so as to fill up the disk as to only use the outermost parts, so you get the fastest speed.
Also your should investigate into Mode2 CDs, they have 171kb/s per X.
Also according to the wikipedia, the GDRom basically is a normal CDRom unit that just spins at half the speed but reads at full speed, so it has twice the storage.
I wonder if its possible to mod the firmware or hardware to let it spin at 2x, so it can’t read GDroms anymore, but CDs at twice the original speed.
A problem might be to make sure the drive is in the GDRom mode so it reads at the desired rate…
FYI, the encoder in lavc can be controlled using -qscale. I’m not sure how well it worked though, but at least it allows for course bitrate control.
Haha sounds like fun.
CPU decode speed is one thing; subjective according to the software.
Data transfer speed is another thing altogether; a hardware limitation.
My concept for the fastest video player on DC possible is basically using the DC as a receiver, over a network for simply displaying decoded data being sent to it from a PC decoding the compressed data.
In this concept, the SH4 CPU speed will never be a limitation. Only the DC’s data transfer speed.
I wish I had a BBA to test some code for this concept.
What I have done is make an un-compressed AVI video player for DC.
A simple outline of a video decoder on DC:
1.) Read compressed video frame ( from CD-Rom )
2.) Decode video packet
3.) Write Decoded Data from RAM to VRAM and DRAW with PVR
A simple outline of my uncompressed AVI player for DC :
1.) Read uncompressed video frame ( from CD-Rom )
2.) Write uncompressed video frame from RAM to VRAM and DRAW with PVR
The maximum speed achieved of my uncompressed AVI player for DC is ~640x480p@24fps.
This was only video; adding audio support would increase resource requirements therefore slow down the speed further.
Yes using uncompressed AVI is maxing out a 12xCD-Rom, but these benchmarks have been made using NullDC, which is loading from a .cdi image and does not suffer from cd-rom read speeds limitations.
I think that something like 512x384p@30fps while handling audio is a realistic goal for a very fast video decoder on DC.
-PH3NOM
Nice to see you using the RoQ encoder. Do not hesitate in telling me if you find any bugs.
@Vitor: Certainly. :-) I found a number of streams that make that RoQ encoder hang.
@Jim and Reimar: Thanks for tips on sector location. If I perform the experiment, I’ll take that into account, perhaps by just reading raw sectors.
@Tjoppen: Thanks, qscale worked, though I can’t seem to get the 640×480 video down to a low enough bitrate. But 320×240 video seems to be adequate, even without changing qscale.
@PH3NOM: I’m not sure I follow your concept all the way. However, if you could use a broadband adapter, you could get about 1 megabyte/second throughput, or about twice what I was seeing with this experiment. There’s a bit of hand-waving in your description about video compression– what format would you use? That has been the point of this exercise. Uncompressed video would be out of the question since raw 320x240x30fps 16-bit video would require about 4.5 megabytes/second.
@Mike: I’m not sure I follow your concept all the way. However, if you could use a broadband adapter, you could get about 1 megabyte/second throughput, or about twice what I was seeing with this experiment.
..
Sorry if I didn’t make much sense.
It was an attempt to measure SH4->PVR maximum throughput.
But I have been without a means to test my code on DC for a few months now, so all testing has been done on NullDC.
Because the code has successfully streamed uncompressed 640x480p@24fps AVi I dont think NullDC emulates CD-Read speeds.
At least you could test your code on NullDC. If the CD read speed is the limitation, then your code should run at full speed.
@Mike: There’s a bit of hand-waving in your description about video compression– what format would you use? That has been the point of this exercise.
..
Sorry for the seemingly off-topic. I thought it was a useful exercise to see what the DC can do using uncompressed avi.
I have been working off and on with libxvid on Dreamcast.
I have a complete player at this point, check this thread for info on its progress:
http://dcemulation.org/phpBB/viewtopic.php?f=29&t=100316
Its currently using the xivd internal yuv->uyvy conversion for use with the PVR_TXRFMT_YUV422. Can you recommend a faster approach?
@PH3NOM: “Because the code has successfully streamed uncompressed 640x480p@24fps AVi” — I’m having a little trouble with this because some basic math tells us that it’s impossible to stream that kind of uncompressed video via either the DC’s optical drive or the DC’s optional 10 megabit ethernet adapter. At least, I think so…
640 * 480 = 307,200 pixels/frame
307,200 pixels/frame * 24 frames/second = 7,372,800 pixels/second.
At issue is the size of a pixel. If we’re talking YUV 4:2:0, that averages 12 bits/pixel:
7,372,800 pixels/second * 12 bits/pixel * (1/8 byte/bits) = 11,059,200 bytes/second
This stretches the boundaries of 10 Mbps ethernet throughput but just might be feasible. Okay.
I haven’t looked at the XviD YUV -> UYVY converter. Does it do proper chroma upsampling (IOW, are there multiplications involved)? If it does, then it could be sped up by doing naive upsampling without any special calculations. That’s what I did in my VP8 decoder.
Check your maths again, Mike. A 10 Mbps Ethernet link can surely not transfer more than 1.25M bytes/s and in reality throughput is closer to 1M bytes/s. The streaming in question is definitely not possible. You’d need a 100Mbps Ethernet and a good measure of luck to make it happen.
@Mans: Oops, thanks for keeping me honest. You’re right– I had 100 Mbps on the brain.
@PH3NOM: I stand by my original point– I still don’t understand your experiment. :-)
@PH3NOM: Okay, I finally registered the bit where you say you’re using NullDC (it helps that I saw your screenshot in the forums that included NullDC). If you’re profiling using an emulator with no I/O speed restrictions, all bets are off.
Its a shame the BBA really is that slow. That kills my concept for using the DC as a receiver for video being decoded on PC.
But that was not the point I was trying to make.
I was attempting to measure SH4->PVR maximum throughput, not CD->SH4->PVR throughput.
I can think of a better way to test this, tonight I will rig up an example if I get time tonight and report back with the results.
@PH3NOM: I remember rigging up some DMA experiments years ago and determined that the DC had enough DMA bandwidth to transfer — I think — 60 textures per second from CPU -> GPU that were each 2 MB large.
Not sure if that helps.
Yeah thanks that seems like DMA is ~4x faster than sq transfers, according to your findings.
I have made another test;
load an uncompressed texture from /cd/ into /RAM/ once.
then send it to /VRAM/ and draw with the PVR 60 times.
measure process time.
Using a 512x512p texture:
function pvr_txr_load() processes 60 frames in 1.088 secconds.
function pvr_txr_load_ex() processes 60 frames in 7.935 secconds.
I dont know why you were using pvr_txr_load_ex() for your DreamRoQ, when you are familiar with the DMA transfer functions?
@PH3NOM: Were those 512×512 textures 16-bit?
Why did I use pvr_txr_load_ex()? Are you familiar with the term “cargo cult”? :-) I dabbled in some DC programming about 7-8 years ago and still had some code laying around. I reused some of it without understanding all the implications. I probably had a good reason for using it in the code from 7-8 years ago.
Yeah the texture was RGB565, 512x512p, 16bit, 524,288 bytes in size.
Haha I had to google ‘cargo cult’ :-) I’ve never heard of that term before.
I would like to see you update the code, using a proper DMA implementation. If done correctly, the render code should be twice as fast as the current updated render code?
@PH3NOM: Yep, cargo culting. As in, “Here’s a rendering function call that worked for me in some code from 7 years ago; I’ll use the same code now without thinking about it.” I think I pulled that from some code that was transferring 8-bit paletted textures. I think it’s necessary to use the _ex function when twiddling textures, with is a requirement with 8-bit textures.
I’ll certainly update the rendering code eventually, though it might not be this week. I have some further code optimizations pending as well– fast as the decoder may be, it could still be faster. However, when I did my initial data rate experiments, rendering wasn’t part of the equation at all– I was just checking how fast I could stream the entire file off the CD and the results indicated that I wouldn’t be able to get all the data in time (at least not for the 640×480 files).
@Mike: (at least not for the 640×480 files)
How did you manage a 640×480 encode? I thought this format was limited to powers of 2?
The only restriction that Switchblade enforces regarding resolution is that width and height must both be multiples of 16. If you want to play the resulting RoQ with the official RoQ decoder, then powers of 2 apply.
However, in the case of the DC, since texture dimensions must be powers of 2, that leads to a lot of wasted memory (e.g., decoding a 640×480 RoQ requires a pair of 1024×512 textures). Something like 512×256 is more workable.
You can have a texture with a non-power of two width (it just has to be a multiple of 32), with no padding, using the stride texture mode. It has to be a linear texture (not twiddled), so bilinear and rotations are slower, but you won’t want it twiddled anyways.
Stride width is a global hardware parameter, you can set it to 640 texel width with this C code (I’m not totally sure if the -1 is needed, but it probably is.):
PVR_SET(PVR_TEXTURE_MODULO, 640/32-1);
So if you want a 640*480 texture, you need to set the standard power-of-two-size to be a larger than the real size (so for a 640*480 real size, you set it to 1024*512) then set the stride bit on the header before you submit it. Something like, “texturehdr.mode3 |= PVR_TXRFMT_STRIDE;” should work.
There’s no way to use a non-power of two height, but since you don’t need the texture to repeat, you can just manually make sure that the space between rows 481 and 512 are not used.
U coordinates work odd in strided texture mode. You would expect that a U of 1.0 would be the left edge of the texture (X pixel coord of 640). It’s actually at based on the power-of-two size, so if you want to map a 640*480 texture onto a quad, the top left corner UVs are (0,0) as normal, but the lower right would be (640.0f/1024.0f, 480.0f/512.0f)
Thanks for that info, TapamN –
I was curious how stride textures work…