April 11th, 2010 by Multimedia Mike
Some news is making the rounds that Google is funding ARM improvements for the Theora video decoder. It gives the free software faithful renewed hope. However, reading this news makes me wonder: Doesn’t FFmpeg already have ARM optimizations for Theora? In fact, it does, as indicated by the existence of the file libavcodec/arm/vp3dsp_neon.S. This has optimized IDCT transform/get/put and loop filter functions for NEON instruction sets. I know there are several different types of SIMD for ARM chips and I don’t know if NEON is the most common variety.
The most pressing reason for funding this effort is, of course, license purity.
Posted in Multimedia PressWatch, VP3/Theora | 9 Comments »
February 19th, 2010 by Multimedia Mike
I’ve been hearing it ever since last August:
Google owns On2. They are going to open source all of On2′s codecs.
Read the rest of this entry »
Posted in On2/Duck, VP3/Theora | 9 Comments »
February 8th, 2010 by Multimedia Mike
Because I have a short memory, I wanted to write down some of the knowledge and wisdom I’ve collected in the past few months as I have been working on optimizing FFmpeg’s VP3/Theora decoder.
These are some of the general tools:
Read the rest of this entry »
Posted in VP3/Theora | 3 Comments »
December 1st, 2009 by Multimedia Mike
I'm still haunted by the slow VP3/Theora decoder in FFmpeg. While profiling reveals that most of the decoding time is spent in a function named unpack_vlcs(), I have been informed that finer-grained profiling indicates that the vast majority of that time is spent evaluating the following if statement:
if (coeff_counts[fragment_num]> coeff_index)
I put counters before and after that statement and ran the good old Big Buck Bunny 1080p movie through it. These numbers indicate, per frame, the number of times the if statement is evaluated and the number of times it evaluates to true:
[theora @ 0x1004000]3133440, 50643
[theora @ 0x1004000]15360, 722
[theora @ 0x1004000]15360, 720
[theora @ 0x1004000]0, 0
[theora @ 0x1004000]13888, 434
[theora @ 0x1004000]631424, 10711
[theora @ 0x1004000]2001344, 36922
[theora @ 0x1004000]1298752, 22897
Thus, while decoding the first frame, the if statement is evaluated over 3 million times but further action (i.e., decoding a coefficient) is only performed 50,000 times. Based on the above sample, each frame sees about 2 orders of magnitude more evaluations than are necessary.
Clearly, that if statement needs to go.
Read the rest of this entry »
Posted in Programming, VP3/Theora | 7 Comments »
October 2nd, 2009 by Multimedia Mike
Actually, I need 80 optimal Huffman trees, but let's take it one step at a time.
The VP3 video codec -- the basis of Theora -- employs 80 different Huffman trees. There are 16 for DC coefficients and 16 each for 4 different AC coefficient groups. An individual VP3/Theora frame gets to select 4 different Huffman trees: one for Y-plane DC, one for C-plane DC, one for Y-plane AC, and one for C-plane AC. VP3 hardcodes these tables. Theora allows more flexibility and an encoder is free to either use the default VP3 trees or create its own set and encode them into the header of the container (typical an Ogg file).
Generating an optimal Huffman tree for a particular set of input is rather well established; any introduction to Huffman codes covers that much. What I'm curious about is how one would go about creating a set of, e.g., 16 optimal Huffman trees for a given input. The first solution that comes to mind is to treat this as a vector quantization (VQ) problem. I have no idea if this idea holds water, or if it even has any sane basis in mathematics, but when has that ever stopped me from running with a brainstorm?
Here's the pitch:
- Modify FFmpeg's VP3/Theora decoder to print after each frame decode the count of each type of token that was decoded from the stream (for each of the 5 coefficient groups, and for each of the plane types), as well as the number of bits that token was encoded with. This will allow tallying of the actual number of bits used for encoding tokens in each frame.
- Create a separate tool to process the data by applying a basic VQ codebook training algorithm. It will be necessary to treat all of the Y-plane AC tokens as single vectors and do the same with the C-plane AC tokens, even though each AC token vector needs to be comprised of 4 separate AC group vectors. Re-use some existing E/LGB code for this step.
- Generate Huffman trees from the resulting vectors and count the number of bits per token for each.
- Iterate through the frequency vectors captured from the first step and match them to the codebooks using a standard distance algorithm.
- Tally the bits from using the new vectors and see if there is any improvement versus the default vectors (Huffman tables).
I don't know if I'll have time to get around to trying this experiment in the near future but I wanted to throw it out there anyway. With all of the improvments that the new Theora encoder brings to the tables, it seems that the custom Huffman trees feature is one that is left un-exercised per my reading of the documentation and source code. From studying the Big Buck Bunny Theora encodes (my standard Theora test vectors these days), I see that they use the default VP3 tables. The 1080p variant occupied 866 MB. Could there be any notable space savings from generating custom Huffman tables? Or was this a pointless feature to add to the Theora spec?
Posted in VP3/Theora | 5 Comments »
September 25th, 2009 by Multimedia Mike
No sooner did I press "Publish" on my last post pertaining to multithreading FFmpeg's Theora decoder, than did I receive an email from the theora-dev list regarding the release of Theora 1.1.0. It took them many, many years to release the first official version and about 10 months to get the second version out, so congratulations on that matter. This release includes the much-vaunted Thusnelda encoder which is supposed to offer substantial encoding improvements vs. the original 1.0 encoder.
So, fair warning: Be prepared for a new round of "Theora Bests H.264 / HTML5 Video Poised To Conquer Internet" type of stories.
Since I have been doing a bunch of optimizations to the FFmpeg Theora decoder this past week (a.k.a. the Theora decoder that most people actually use), I thought this would be the perfect opportunity to benchmark Theora 1.1 alongside FFmpeg's decoder. Fortunately, libtheora has an example tool called dump_video that decodes video directly to YUV4MPEG2 format, the same way I was testing FFmpeg's decoder.
FFmpeg command line:
ffmpeg -threads n -i big_buck_bunny_1080p_stereo.ogg
-f yuv4mpegpipe -an -y /dev/null
Libtheora command line:
dump_video big_buck_bunny_1080p_stereo.ogg > /dev/null
The results (on my Core 2 Duo Mac Mini) were thus:
6:44 - FFmpeg, 1 thread
6:09 - FFmpeg, 2 threads *
4:51 - libtheora 1.1
* multithreaded version isn't complete yet
Mind you, libtheora's decoder is singly-threaded and only has basic MMX SIMD optimizations. After seeing libtheora's relative performance, I think I blacked out. Or maybe I just went to bed since it was so late; it's sort of a blur. I awoke in a confused stupor wondering what I'm doing wrong in the FFmpeg Theora decoder. Why is it so slow? Actually, I know why-- unpack_vlcs(), which continues to dominate profiling statistics. Perhaps the question I should start with is, how does libtheora unpack VLCs so quickly? That's a good jumping-off point for a future investigation.
Posted in VP3/Theora | 10 Comments »
September 24th, 2009 by Multimedia Mike
As briefly mentioned in my last Theora post, I think FFmpeg's Theora decoder can exploit multiple CPUs in a few ways: 1) Perform all of the DC prediction reversals in a separate thread while the main thread is busy decoding the AC coefficients (meanwhile, I have committed an optimization where the reversal occurs immediately after DC decoding in order to exploit CPU cache); 2) create n separate threads and assign each (num_slices / n) slices to decode (where a slice is a row of the image that is 16 pixels high).
So there's the plan. Now, how to take advantage of FFmpeg's threading API (which supports POSIX threads, Win32 threads, BeOS threads, and even OS/2 threads)? Would it surprise you to learn that this aspect is not extensively documented? Time to reverse engineer the API.
I also did some Googling regarding multithreaded FFmpeg. I mostly found forum posts complaining that FFmpeg isn't effectively leveraging however many n cores someone's turbo-charged machine happens to present to the OS, as demonstrated by their CPU monitoring tool. Since I suspect this post will rise in Google's top search hits on the topic, allow me to apologize to searchers in advance by explaining that multimedia processing, while certainly CPU-intensive, does not necessarily lend itself to multithreading/multiprocessing. There are a few bits here and there in the encode or decode processes that can be parallelized but the entire operation overall tends to be rather serial.
So this is the goal:
...to see FFmpeg break through the 99.9% barrier in the CPU monitor. As an aside, it briefly struck me as ironic that people want
FFmpeg to use as much of as many available CPUs as possible but scorn
the project from my day job
for being quite capable of doing the same.
Moving right along, let's see what can be done about exploiting what limited multithreading opportunities that Theora affords.
First off: it's necessary to explicitly enable threading at configure-time (e.g., "--enable-pthreads" for POSIX threads on Unix flavors). Not sure why this is, but there it is.
Read the rest of this entry »
Posted in VP3/Theora | 18 Comments »
September 19th, 2009 by Multimedia Mike
Google released the third version of their year-old Chrome browser this past week. This reminded me that they incorporate FFmpeg into the software (and thanks to the devs for making various fixes available to us). Chrome uses FFmpeg for decoding HTML5/video tag-type video and accompanying audio. This always makes me wonder, why would they use FFmpeg's Theora decoder? It sucks. I should know; I wrote it.
Last year, Reimar discovered that the VP3/Theora decoder spent the vast majority of its time decoding the coefficient stream. He proposed a fix that made it faster. I got a chance to check out the decoder tonight and profile it with OProfile and FFmpeg's own internal timer facilities. It turns out that the function named unpack_vlcs() is still responsible for 44-50% of the decoding time, depending on machine and sample file. This is mildly disconcerting considering the significant amount of effort I put forth to even make it that fast (it took a lot of VLC magic).
So a function in a multimedia program is slow? Well, throw assembly language and SIMD instructions at the problem! Right? It's not that simple with entropy decoders.
Reimar had a good idea in his patch and I took it to its logical conclusion: Optimize away the arrows, i.e., structure dereferences. The function insists on repeatedly grabbing items out of arrays from a context structure. Thus, create local pointers to the same array and save a bunch of dereferences through each of the innumerable iterations.
Results were positive-- both OProfile and the TSC-based internal counter showed notable improvements.
Ideas for further improvements: Multithreading is all the rage for video decoders these days. Unfortunately, entropy decoding is always a serial proposition. However, VP3/Theora is in a unique position to take advantage of another multithreading opportunity: It could call reverse_dc_prediction() in a separate thread after all the DC coefficients are decoded. Finally, an upside to the algorithm's unorthodox bitstream format! According to my OProfile reports, reverse_dc_prediction() consistently takes around 6-7% of the decode time. So it would probably be of benefit to remove that from the primary thread which would be busy with the AC coefficients.
Taking advantage of multiple threads would likely help with the render_slice() function. One thing at a time, though. Wish me luck with presenting the de-dereferencing patch to the list.
Posted in Programming, VP3/Theora | 4 Comments »
August 6th, 2009 by Multimedia Mike
I read something in the past few months which noted that, in this day and age, the ultimate phase of any tech startup's business plan is to be purchased by Google. Viewed through that lens, On2 is about to live the dream, even though they existed years before Google, years before most people even knew what a search engine was.
So Google has announced its intention to purchase On2. Wow, it feels like the end of an era. It seems like I've had some relationship with On2 for the entire 9 years I've been into multimedia hacking. Something that got lost in yesterday's coverage and commentary was that On2 started life as the Duck Corporation, also a codec company. During this period, they largely focused on gaming applications. But I'm pretty sure that RAD Game Tools kicked them out of that market with their Smacker and Bink technologies. However, files encoded with the Duck's multimedia codecs were among the first I studied back around 2000-2001. So that always makes me sentimental.
Read the rest of this entry »
Posted in On2/Duck, VP3/Theora | 12 Comments »
November 3rd, 2008 by Multimedia Mike
Wow, it seems like only yesterday that I downloaded the newly open sourced On2 VpVision source code package and started reverse engineering an English language description of the VP3 video coding algorithm. Well, actually, that was nearly 7 years ago. VP3 eventually formed the basis of the Xiph Theora video codec. And today Theora is pleased to announce that the codec is finally, well... final. It's out. No more alpha/beta phases. The codec is ready for primetime use and should be conquering the digital media frontier in short order.
You know, just like Vorbis.
Posted in Open Source Multimedia, VP3/Theora | 14 Comments »