Profiling and Optimizing Theora | Breaking Eggs And Making Omelettes

Because I have a short memory, I wanted to write down some of the knowledge and wisdom I’ve collected in the past few months as I have been working on optimizing FFmpeg’s VP3/Theora decoder.

Profiling Methods
These are some of the general tools:

visual inspection via ‘top’ and ‘time’: For interactive programs that are taking a lot of time to run, ‘top’ can give a very coarse evaluation of CPU usage. For standalone programs that just need to crunch numbers to complete a task as quickly as possible, prepend the command with ‘time’ to get a similarly coarse view of performance.
oprofile: I like this tool a lot for profiling. Be advised that it will not work in a virtualized session.
gprof: If you must. It gives you some decent information. The main point of the profilers is to sniff out general hotspots. For very fine-grained profiling…
FFmpeg’s internal START_TIMER/STOP_TIMER() macros: These tell you if your alleged optimizations are actually doing the trick. The results may surprise you.
Remember to disable inlining for more accurate readings (-fno-inline-functions and -fno-inline-functions-called-once).

Recent Optimizations
Optimizing in the multimedia domain often seems to come down to employing parallelizing SIMD instructions. However, most if not all of what can be SIMD-parallelized for VP3/Theora has already been done for years in many cases. What I find interesting about these recent optimizations is that all have been at the algorithmic level:

I turned a monotonous array traversal into a lean list traversal for a giant improvement (no, I will never stop bragging about it).
I removed a lot of structure dereference operations. This is something that many observers state that the compiler should have managed automatically. But it didn’t and performing the same operation manually netted a performance improvement.
Exploiting general knowledge of CPU memory caching, I moved a subsequent data operation (DC prediction reversal) to occur just after the data was decoded from the bitstream, thereby maximizing the probability that the coefficients will still be cached.
Dark Shikari found some code that looked like this:
```
if (A && B)
   { ... }
if (A && C)
   { ... }
if (A && D)
   { ... }
```
and factored out condition A to look like:
```
  if (A) {
    if (B)
      ...
    if (C)
      ...
    if (D)
      ...
  }
```
thereby only evaluating condition A once per iteration. It may look minor but it makes a measurable difference.

Further Areas For Improvement
These are some more ideas for improving the performance of FFmpeg’s Theora decoder. It’s unlikely that I will have time or motivation to work on any of them in the near future. But I will be happy to consult (at least, for the ones I understand) if anyone else would like to take them on:

Exploit the fact that many macroblocks move their 4 constituent luma blocks as a cohesive unit. This lends itself to using 16×16 block functions rather than operating on 4 8×8 blocks separately. This is actually tangentially related to SIMD optimization since the decoder can exploit SIMD functions to move 16 bytes at a time rather than 8. However, it will take some algorithmic rearrangement to determine when it’s safe to move a whole 16×16 luma block.
Study code around every continue statement. Is the code mindlessly iterating over another large (or even medium) array? If so, think about creating another list.
Dark Shikari noticed that there are a lot of paranoid range checks in the code. This is no accident– I was extremely paranoid when I wrote the original code due to hard-won experience. I just wanted the code to work right at first. I think we’re well past that stage by now so it may be worth revisiting this code to see if any of those many checks are redundant or if they’re even necessary based on evaluating whether they can ever get into invalid states (accounting for the most maliciously invalid input).
Implement partial IDCTs: The original VP3 decoder has special IDCTs to handle fragments where only the DC coefficient is non-zero or where only the first 3 or 10 coefficients are non-zero. A little printf reconnaissance to log coefficient count statistics clearly indicates that the 1-, 3- and 10-coefficient cases collectively occur far more frequently than the full 64-coefficient transform. The question is whether these transforms will necessarily translate to a notable optimization. This does involve SIMD. Thankfully, a lot of the work lives in the original VP3 source code which already sports (Win32) MMX and SSE2 variants for the 1-, 3-, 10-, and 64-coefficient IDCTs and (Mac) AltiVec versions of the 10- and 64-coefficient variations.
Dark Shikari advised that unpack_vectors() (which decodes the motion vector segment of a frame) requires an inordinate amount of CPU time and that it “can be made at least twice as fast.” No further details, but Monsieur Shikari generally has a keen eye for these kinds of hot spots. He just likes to speak in riddles, sort of.
Decode coefficients in Hilbert order: David Conrad has been advocating this approach for a long time and even has some partially working code to that effect. I admit that I still don’t understand what this approach entails. Dconrad reports that his prototype code demonstrates substantial speedup, even after my recent refactoring.

3 thoughts on “Profiling and Optimizing Theora”

Diego E. â€œFlameeyesâ€ PettenÃ² February 9, 2010 at 6:39 am

You forgot about the perf tool from the kernel. As you use Gentoo, it’s dev-util/perf (yes I got to bump it).
dconrad February 9, 2010 at 12:02 pm

Well, I only have one more bug to fix (introduced yesterday when cleaning up the calculation of when to do 16×16 MC) before I start sending patches for speedups ;) So far, it’s 20-30% faster depending on source and resolution.

Some other things I’ve tried:
The whole render in coding order also allows unpack_vectors() to become just a for loop to decode X number of motion vectors, putting off block assignment and stack maintenance until immediately before MC; the function shows up on profiles any more.

Keyframes by definition have all coded blocks, so reverse_dc_prediction and apply_loop_filter can have simplified intra-only versions.
Multimedia Mike Post authorFebruary 9, 2010 at 12:54 pm

Great news; I look forward to profiling the speedups (faster than libtheora yet? YET?)

Comments are closed.