No sooner did I press “Publish” on my last post pertaining to multithreading FFmpeg’s Theora decoder, than did I receive an email from the theora-dev list regarding the release of Theora 1.1.0. It took them many, many years to release the first official version and about 10 months to get the second version out, so congratulations on that matter. This release includes the much-vaunted Thusnelda encoder which is supposed to offer substantial encoding improvements vs. the original 1.0 encoder.
So, fair warning: Be prepared for a new round of “Theora Bests H.264 / HTML5 Video Poised To Conquer Internet” type of stories.
Since I have been doing a bunch of optimizations to the FFmpeg Theora decoder this past week (a.k.a. the Theora decoder that most people actually use), I thought this would be the perfect opportunity to benchmark Theora 1.1 alongside FFmpeg’s decoder. Fortunately, libtheora has an example tool called dump_video that decodes video directly to YUV4MPEG2 format, the same way I was testing FFmpeg’s decoder.
FFmpeg command line:
ffmpeg -threads n -i big_buck_bunny_1080p_stereo.ogg -f yuv4mpegpipe -an -y /dev/null
Libtheora command line:
dump_video big_buck_bunny_1080p_stereo.ogg > /dev/null
The results (on my Core 2 Duo Mac Mini) were thus:
6:44 - FFmpeg, 1 thread 6:09 - FFmpeg, 2 threads * 4:51 - libtheora 1.1
* multithreaded version isn’t complete yet
Mind you, libtheora’s decoder is singly-threaded and only has basic MMX SIMD optimizations. After seeing libtheora’s relative performance, I think I blacked out. Or maybe I just went to bed since it was so late; it’s sort of a blur. I awoke in a confused stupor wondering what I’m doing wrong in the FFmpeg Theora decoder. Why is it so slow? Actually, I know why– unpack_vlcs(), which continues to dominate profiling statistics. Perhaps the question I should start with is, how does libtheora unpack VLCs so quickly? That’s a good jumping-off point for a future investigation.
BTW : does ffmpeg is still able to decode Thusnelda encoded video ?
Last time I check ffmpeg theora decoder doesn’t implement all theora feature.
But some chromium patch seems to implement some features. Will they be merged ?
The ffmpeg macros used for PlayStation VLC unpacking [1] are 3 levels deep, and are impossible to decipher for a mere mortals. I’m sure its magic is blazingly fast, so it makes me wonder just what it is doing, and what other tricks and techniques can be used for VLC unpacking.
[1] http://cekirdek.pardus.org.tr/~ismail/ffmpeg-docs/mdec_8c-source.html#l00054
@mat: Good question. The release notes mention that this is the first production Theora encoder to use adaptive quantization and I’m not sure if this is in FFmpeg’s decoder yet. We have been merging some of Chromium’s fixes, starting with the security matter (I merged one myself the other day for the Theora decoder).
@Michael: Yeah, that’s nasty. I wonder why that is? There must be something unusual about the VLCs on PS1 games (which were never exactly standardized). VLC reading code isn’t usually that impenetrable. In fact, the entirety of this problematic unpack_vlcs() function fits on a single (tall) web browser page.
Maybe that’s the problem– that it’s not complicated enough.
Pingback: Xuggler 3.3 « Xuggle
The problem is the data layout; it spends all its time in the “if (coeff_counts[fragment_num] > coeff_index) continue;” check.
David Conrad was working on that, I think.
@astrange: Thanks for the tip. I think I see where to focus my creativity.
I was, it turned into a major effort in figuring out how to minimize the cache footprint and eliminate the block macroblock superblock mapping tables. My motivation was drained when I realized that it was impossible to eliminate those ugly mapping tables due to stupidly minor things in the spec, plus I was busy with GSoC and now classes.
But the main reason libtheora is faster is because it doesn’t sort AC coefficients into the blocks that own them, instead doing IDCT in hilbert order so that no reordering is needed.
Also, -f yuv4mpeg4pipe or -f rawvideo involve memcpy() on each pixel row, which is slow, whereas dump_video calls fwrite() on each pixel row, which is much faster. A fairer benchmark would be -f null and excising the fwrite()s from dump_video.
@dconrad: Thanks for the tip regarding ‘-f null’. That definitely helps to narrow the gap.
Libtheora does IDCT in Hilbert order? That blows my mind. Or is it zigzag order?
The coeffs are already stored in the the bitstream in Hilbert order; the trick is to simply decode them into a single array and keep pointers to the first coeff of each zigzag index in each plane so that you simply pull the next coeff for each index for each block.
This unfortunately requires duplicating all the EOB and zero run logic in both unpack_vlcs and right before you do IDCT, plus a larger slice size (16 -> 64 pixels), but it’s worth it to avoid the current O(n^2) sorting loop in upack_vlcs.
mat, Mike: It really should do now – support for AQ was added in r18986, back in May. Chromium took that from ffmpeg, not the other way around – see http://code.google.com/p/chromium/issues/detail?id=17174
I haven’t checked whether it’s bit-exact but it certainly looks OK.