Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering



February 19th, 2006 by Multimedia Mike

After sorting out Trixter’s 8088 Corruption details sometime ago I started to wonder about FMV on other relatively low-power systems. Let’s consider the Super Nintendo Entertainment System (SNES).

Super Nintendo Entertainment System

The SNES came out quite some time after the original IBM PC (10 years?). Still, the original IBM PC targeted in Trixter’s experiment had several advantages such as more capacity (10 megabytes of HD space), a marginally more powerful CPU (Intel 8088 @ 4.77 MHz), and a pre-defined vector codebook for the FMV hack.

Let’s start with a modest goal: 1 full minute (60 seconds) of full motion video and audio on the regulation Super Nintendo Entertainment System hardware.

Per my understanding, the largest ROM size for an SNES game is 4 megabytes. (This may be a bit confusing for those who remember reading Nintendo copy bragging about 8- or 16-“meg” cartridges; their “meg” was actually a megabit. Divide by 8 for megabytes.) Let’s get the audio out of the way up front. The SNES audio hardware (Sony SPC-700) operates at a native playback rate of 32 kHz. That’s 32000 samples/second. Each sample is 16 bits. However, it is stored in a packing format identified as bit-rate reduced (BRR). The short story is that 16 16-bit samples are stored in a 9-byte block (8 bytes/16 nibbles of scaled audio information + 1 block header byte). If we want to store 60 seconds of audio:

  60 seconds * 32000 samples/second * (9 bytes/16 samples) = 1080000 bytes

Hmm. I was actually hoping that would be less than a base-2 megabyte (1048576 bytes). How many total blocks will we need to store this much audio information?

  60 seconds * 32000 samples/second * (1 block/16 samples) = 120000 blocks

Each header byte defines 4 bits for the audio scaler. The 4 other bits will not be used in this application. Therefore, throw those 4 bits away. Then, one header byte can be used for 2 consecutive audio blocks and properly reassembled during playback.

  120000 blocks * 8 audio data bytes/block = 960000 audio data bytes
  120000 blocks * (1 header byte / 2 blocks) = 60000 header bytes
  960000 audio data bytes + 60000 header bytes = 1020000 total bytes for audio data

And we come in at just under a regulation base-2 megabyte. Ideally, this leaves 28576 bytes for hand-crafted 65816 ASM code and a full 3 megabytes for coded video data.

As for the video, let’s be ambitious and target 30 frames/second video at the best standard color and resolution that the SNES can do: 256x224x256 colors, palettized. The goal is to encode 60 seconds which means 1800 frames. Looking at these constraints on a per-frame basis and a per-second basis:

  3 * 220 bytes = 3145728 bytes
  3145728 bytes / 1800 frames = ~1747 bytes/frame
  3145728 bytes / 60 seconds = ~52428 bytes/second

The video coding method has to use vector quantization, not just because it sounds neat but because the SNES video hardware is tile-based and we have no choice. If we use the SNES famous mode 7 we will effectively have a codebook of 256 8×8-pixel vectors to choose from. Each frame will be comprised of 896 vectors. Coding each frame raw — or as raw as we could get since we need to come up with an optimal 256-vector codebook for each frame — would require:

  256 vectors * ( 8 * 8 ) pixels/vector * 1 byte/pixel = 16384 bytes
  256 palette entries * 2 bytes/entry = 512 bytes
  896 vector indices * 1 byte/index = 896 bytes
  (16384 + 512 + 896) bytes = 17792 bytes

SNES uses RGB15 colors. Technically, we could save 32 bytes by dropping the unused bit and packing palette entries really tight, but let’s not right now.

Anyway, 17792 bytes is what I estimate it would take to encode each frame as raw as possible. That’s about 10 times more than is presently budgeted for each frame.

Another approach is to think in groups of 30. Each second is allotted 52428 bytes to encode 30 frames. Encode the first frame raw which will take 17792 bytes leaving 34636 bytes. Assume the remaining 29 frames in the group will also require the full 896 index bytes to be expressed.

  896 * 29 = 25984 bytes

This leaves 8652 bytes in the group budget. We can use these for replacement vectors as needed throughout the remainder of the group:

  8652 bytes * (1 vector/64 bytes) = ~135 vectors
  135 vectors / 29 frames = ~4.6 vectors

These are just some cursory brainstorms (that I will likely never implement– that’s why it’s filed under Outlandish Brainstorms). There is also the possibility of sacrificing framerate, dropping down to 15 fps, and gaining a lot more bytes in the budget. Further, audio could be resampled to other acceptable frequencies such as 22050 Hz.

Posted in Outlandish Brainstorms | 3 Comments »

3 Responses

  1. Alex Says:

    LOL. I would love to see this actually work.

    The largest SNES game that I know of is Tales of Phantasia which is 6MB. It mapped the normal 4MB into the upper address space and then mapped an additional 2MB into the lower address space. This game did not have FMV, but did have recorded voices singing the opening song.

    Also there was Out of this World, which had an opening movie. It was a port of a PC game which was originally done using polygonal graphics. I’m not sure how it was implemented on the SNES, but I guess you could email the author and ask. The author’s website is

  2. Tommy Says:

    Hey, NBA Jam TE for the SNES has actual video during the end credits- it is small and blocky with no sound, though. And they’re right about TOP being 6MBytes.

  3. Rebecca Says:

    Yes, the SNES version of Out of this world have a full opening movie. It was done by storing vectors of all the polygons and playing them back as a vector animation (Much like flash does on websites today). The blitters were hand optimized 65816 assembly.

    FYI, Out of this world was a 512K rom.