The Parallelized Elephants In The Room

I think it’s time to face up to the fact that this whole parallelization fad is probably not going to go away. There was a recent thread of ffmpeg-devel regarding the possibility of ‘porting’ FFmpeg to something called the Nvidia Tesla. This discussion rekindled a dormant interest I have regarding what optimization possibilities could be in store for the Cell processor on board the Sony PlayStation 3, and whether there should be effort directed toward making FFmpeg capable of using such features.

SPE Element SPE Element
SPE Element PPE Element PPE Element SPE Element
SPE Element SPE Element

I finally took some time to read through many of the basic and advanced tutorials on offer and finally have a feel for what the system is set up to do. Unfortunately, it’s not always clear what these parallel architectures are capable of, a situation only exacerbated by vague, impenetrable marketing materials. Too many people confuse the Cell architecture with a homogenous multiprocessor environment, as is common today, which is simply not the case. In order to take advantage of the machine’s full power, an app has to be written with a special awareness of the fact that the Cell has a primary core (PPE) and 6 little helper coprocessors (SPEs), as is half-heartedly illustrated above. The PPE is a dual-threaded general-purpose CPU (64-bit PowerPC) and can do anything. Meanwhile, each SPE is essentially another 64-bit PPC that has its own pool of 256 kilobytes of memory (LS) and a special memory controller (MFC) that coordinates contact with the outside world. To take advantage of the SPEs, the PPE has to load programs into their memory space and tell them to execute the code. The Cell also features DMA facilities to efficiently shuttle data between main memory and the SPEs’ local memory, and there are mailbox facilities and interrupts to facilitate communication between the PPE and the SPEs.

I don’t know about a general parallelized architecture for FFmpeg that would take advantage of multiple architectures like Cell and Tesla (because I still can’t figure out how Tesla is supposed to work). However, in a media playback application, it might be possible to assign one SPE the task of decoding perceptual audio. Another SPE might be performing inverse transform operations for a video codec, while another SPE does postprocessing and yet another handles YUV -> RGB conversion. On the opposite end, it seems reasonable that SPEs could be put to work at tasks like motion estimation for video encoding.

Would this qualify as a Google Summer of Code project for FFmpeg? There is precedent for this– see “Development assistant for the ‘Ghost’ audio codec” which was essentially a lab rat for Monty’s (of Vorbis fame) newer audio coding ideas. Fortunately, a prospective student would not require a PS3 for this project; just a Linux machine. For it seems that IBM has a freely downloadable tool called the Cell Simulation Environment. I’m still working on getting the program running (it’s distributed as an RPM and is most happy on a Red Hat system).

I am a little surprised that there is not a PS3 Media Center project, in the spirit of the Xbox Media Center, at least not that I have been able to locate via web searches. I have been pondering the technical plausibility of such an endeavor. It almost seems as though the PS3 gives the guest OS just enough of a confined playground environment that it can’t possibly blossom into a reasonably high-end enviroment. While real-time video playback must be possible, is it possible to run at, say, full 1080p resolution at 30 fps? With all of the processing power, I trust that the Cell can handle any kind of video decoding, though I heard an unsubstantiated rumor once that it takes the PPE and 4 SPEs to decode HD H.264 video from a Blu-Ray disc. The PS3’s native HD player would have a slight advantage since it would presumably use the video hardware’s full feature set, which likely allows the PS3 to pass through raw 12-bit YUV data to be handled by the video hardware, in one way or another. In Linux under the hypervisor, you basically get to play with a big RGB frame buffer. That means that not only to you have to convert YUV -> RGB, but you also have to shuffle 2.5x as much raw video data to the video memory for each frame. That works out to upwards of 250 MB of data shuffling each second ((1920 * 1080 pixels/frame) * (4 bytes/pixel) * (30 frames/second)). I have read conflicting sources about whether it’s possible for Linux under the PS3 hypervisor to DMA data from main RAM to system RAM. Some sources contend that there is work ongoing while other sources claim that this feature was fixed in later firmware revisions (i.e., no longer possible).

One possible dealbreaker in the proposal to use the PS3’s guest OS mode to install Linux and a general purpose media player is that, from everything I have read, the hypervisor only allows the guest OS to output stereo audio. This might be a long shot, but perhaps it would be possible to transcode super-stereo (more than 2 channels) audio to Dolby Pro Logic II to be sent out to a capable decoder module. Hey, it’s sort of like true surround sound.

If you are interested in the hard technical details of running Linux on a PlayStation 3 and programming its Cell Processor, this directory at seems to be fairly authoritative on the matter. The latest iteration of the tech documents (dated 2008-02-01) are here.

8 thoughts on “The Parallelized Elephants In The Room

  1. Reimar

    Tesla is still a graphics card with a C compiler built for it and a way to use it without all the ugly OpenGL initialization stuff (actually that is what CUDA is about, Tesla is just a special pre-built hardware for that).
    Unfortunately, their samples only cover the case of an application completely for CUDA, not how to e.g. implement only certain functions with CUDA while compiling the rest as always (I would assume this must be possible).
    But both CUDA and PS3 have the same problems, there are “huge” transfer delays that must be covered up somehow, so the current “one block at a time” approach to idct and similar used in FFmpeg just will not scale well.
    For PS3, extracting things like CABAC out and run it on a separate SPE is an option, for CUDA not because because CABAC will crawl like on a Pentium 1 on that anyway ;-)
    I was thinking about reworking the FFmpeg dsputil model to add not only the “do it right now” idct functions but also something like idct_submit and idct_barrier that would allow enqueuing idct ops, thus hiding any latency. Unfortunately I just do not have the kind of time needed for this…

  2. Multimedia Mike Post author

    True, there is the issue of transfer overhead. However, the SDK suggests strategies for double-buffering (traditionally applied to graphics buffer swapping). DMA a buffer of data over and let the SPE do its thing while DMA’ing a second buffer over.

    The 256 Kbyte pool concerns me, though; all the code and data needs to be stuffed in there.

  3. Reimar

    > DMA a buffer of data over and let the SPE do its thing while
    > DMA’ing a second buffer over.

    Well, of course, that is exactly what I was trying to get at.
    This is not possible currently with FFmpeg, because the code needs the first block _finished_ before _starting_ the second one.

  4. Steve T

    One thing that the PS3 has going for it is that the default media player is actually pretty good. Not perfect, but it’s not as restrictive as the original XBOX was. They even added divx playback so you can playback all your favorite divx movies without a reencode.

Comments are closed.