Google released the third version of their year-old Chrome browser this past week. This reminded me that they incorporate FFmpeg into the software (and thanks to the devs for making various fixes available to us). Chrome uses FFmpeg for decoding HTML5/video tag-type video and accompanying audio. This always makes me wonder, why would they use FFmpeg’s Theora decoder? It sucks. I should know; I wrote it.
Last year, Reimar discovered that the VP3/Theora decoder spent the vast majority of its time decoding the coefficient stream. He proposed a fix that made it faster. I got a chance to check out the decoder tonight and profile it with OProfile and FFmpeg’s own internal timer facilities. It turns out that the function named unpack_vlcs() is still responsible for 44-50% of the decoding time, depending on machine and sample file. This is mildly disconcerting considering the significant amount of effort I put forth to even make it that fast (it took a lot of VLC magic).
So a function in a multimedia program is slow? Well, throw assembly language and SIMD instructions at the problem! Right? It’s not that simple with entropy decoders.
Reimar had a good idea in his patch and I took it to its logical conclusion: Optimize away the arrows, i.e., structure dereferences. The function insists on repeatedly grabbing items out of arrays from a context structure. Thus, create local pointers to the same array and save a bunch of dereferences through each of the innumerable iterations.
Results were positive– both OProfile and the TSC-based internal counter showed notable improvements.
Ideas for further improvements: Multithreading is all the rage for video decoders these days. Unfortunately, entropy decoding is always a serial proposition. However, VP3/Theora is in a unique position to take advantage of another multithreading opportunity: It could call reverse_dc_prediction() in a separate thread after all the DC coefficients are decoded. Finally, an upside to the algorithm’s unorthodox bitstream format! According to my OProfile reports, reverse_dc_prediction() consistently takes around 6-7% of the decode time. So it would probably be of benefit to remove that from the primary thread which would be busy with the AC coefficients.
Taking advantage of multiple threads would likely help with the render_slice() function. One thing at a time, though. Wish me luck with presenting the de-dereferencing patch to the list.
The only thing I don’t like about it is that the compiler is supposed to do that kind of thing without having to manually bloat the code.
That may have something to do with protecting against aliasing. A few C99 ‘restrict’ qualifiers might be enough to tell GCC that it can go ahead with the optimization. Then again, GCC may just be too stupid.
If its serial, and can’t be parallelized easily, how about looking at pre-fetch, to ensure that when an instruction is ready to process, its data is already in L1 cache? is this something that could be improved?
@Boris: I thought that FFmpeg’s bitreading functions already did something along those lines, but I could be wrong. I’ll look into that.
Meanwhile, I have had some more optimization success by simply reversing DC prediction shortly after all the DC coefficients are decoded from the bitstream, rather than waiting until after all of the AC coefficients are also decoded.