{"id":2191,"date":"2010-02-08T19:12:33","date_gmt":"2010-02-09T03:12:33","guid":{"rendered":"http:\/\/multimedia.cx\/eggs\/?p=2191"},"modified":"2020-07-25T23:52:30","modified_gmt":"2020-07-26T06:52:30","slug":"profiling-and-optimizing-theora","status":"publish","type":"post","link":"https:\/\/multimedia.cx\/eggs\/profiling-and-optimizing-theora\/","title":{"rendered":"Profiling and Optimizing Theora"},"content":{"rendered":"<p>Because I have a short memory, I wanted to write down some of the knowledge and wisdom I&#8217;ve collected in the past few months as I have been working on optimizing <a href=\"http:\/\/ffmpeg.org\/\">FFmpeg&#8217;s<\/a> VP3\/Theora decoder.<\/p>\n<p><strong>Profiling Methods<\/strong><br \/>\nThese are some of the general tools:<br \/>\n<!--more--><\/p>\n<ul>\n<li>visual inspection via &#8216;top&#8217; and &#8216;time&#8217;: For interactive programs that are taking a lot of time to run, &#8216;top&#8217; can give a very coarse evaluation of CPU usage. For standalone programs that just need to crunch numbers to complete a task as quickly as possible, prepend the command with &#8216;time&#8217; to get a similarly coarse view of performance.<\/li>\n<li>oprofile: I like this tool a lot for profiling. Be advised that it will not work in a virtualized session.<\/li>\n<li>gprof: If you must. It gives you some decent information. The main point of the profilers is to sniff out general hotspots. For very fine-grained profiling&#8230;<\/li>\n<li>FFmpeg&#8217;s internal START_TIMER\/STOP_TIMER() macros: These tell you if your alleged optimizations are actually doing the trick. The results may surprise you.<\/li>\n<li>Remember to disable inlining for more accurate readings (-fno-inline-functions and -fno-inline-functions-called-once).<\/li>\n<\/ul>\n<p><strong>Recent Optimizations<\/strong><br \/>\nOptimizing in the multimedia domain often seems to come down to employing parallelizing SIMD instructions. However, most if not all of what can be SIMD-parallelized for VP3\/Theora has already been done for years in many cases. What I find interesting about these recent optimizations is that all have been at the algorithmic level:<\/p>\n<ul>\n<li>I <a href=\"http:\/\/multimedia.cx\/eggs\/one-gluttonous-if-statement\/\">turned a monotonous array traversal into a lean list traversal<\/a> for a giant improvement (no, I will <em>never<\/em> stop bragging about it).<\/li>\n<li>I <a href=\"http:\/\/multimedia.cx\/eggs\/optimizing-away-arrows\/\">removed a lot of structure dereference operations<\/a>. This is something that many observers state that the compiler should have managed automatically. But it didn&#8217;t and performing the same operation manually netted a performance improvement.<\/li>\n<li>Exploiting general knowledge of CPU memory caching, I moved a subsequent data operation (DC prediction reversal) to occur just after the data was decoded from the bitstream, thereby maximizing the probability that the coefficients will still be cached.<\/li>\n<li><a href=\"http:\/\/x264dev.multimedia.cx\/\">Dark Shikari<\/a> <a href=\"http:\/\/lists.mplayerhq.hu\/pipermail\/ffmpeg-devel\/attachments\/20091202\/0ca54b0a\/attachment.obj\">found some code<\/a> that looked like this:\n<pre>\r\nif (A && B)\r\n   { ... }\r\nif (A && C)\r\n   { ... }\r\nif (A && D)\r\n   { ... }\r\n<\/pre>\n<p>and factored out condition A to look like:<\/p>\n<pre>\r\n  if (A) {\r\n    if (B)\r\n      ...\r\n    if (C)\r\n      ...\r\n    if (D)\r\n      ...\r\n  }\r\n<\/pre>\n<p>thereby only evaluating condition A once per iteration. It may look minor but it makes a measurable difference.\n<\/li>\n<\/ul>\n<p><strong>Further Areas For Improvement<\/strong><br \/>\nThese are some more ideas for improving the performance of FFmpeg&#8217;s Theora decoder. It&#8217;s unlikely that I will have time or motivation to work on any of them in the near future. But I will be happy to consult (at least, for the ones I understand) if anyone else would like to take them on:<\/p>\n<ul>\n<li>Exploit the fact that many macroblocks move their 4 constituent luma blocks as a cohesive unit. This lends itself to using 16&#215;16 block functions rather than operating on 4 8&#215;8 blocks separately. This is actually tangentially related to SIMD optimization since the decoder can exploit SIMD functions to move 16 bytes at a time rather than 8. However, it will take some algorithmic rearrangement to determine when it&#8217;s safe to move a whole 16&#215;16 luma block.<\/li>\n<li>Study code around every continue statement. Is the code mindlessly iterating over another large (or even medium) array? If so, think about creating another list.<\/li>\n<li>Dark Shikari noticed that there are a lot of paranoid range checks in the code. This is no accident&#8211; I was extremely paranoid when I wrote the original code due to hard-won experience. I just wanted the code to <em>work right<\/em> at first. I think we&#8217;re well past that stage by now so it may be worth revisiting this code to see if any of those many checks are redundant or if they&#8217;re even necessary based on evaluating whether they can ever get into invalid states (accounting for the most maliciously invalid input).<\/li>\n<li>Implement partial IDCTs: The original VP3 decoder has special IDCTs to handle fragments where only the DC coefficient is non-zero or where only the first 3 or 10 coefficients are non-zero. A little printf reconnaissance to log coefficient count statistics clearly indicates that the 1-, 3- and 10-coefficient cases collectively occur far more frequently than the full 64-coefficient transform. The question is whether these transforms will necessarily translate to a notable optimization. This does involve SIMD. Thankfully, a lot of the work lives in the original VP3 source code which already sports (Win32) MMX and SSE2 variants for the 1-, 3-, 10-, and 64-coefficient IDCTs and (Mac) AltiVec versions of the 10- and 64-coefficient variations.<\/li>\n<li><a href=\"http:\/\/lists.mplayerhq.hu\/pipermail\/ffmpeg-devel\/2009-December\/079220.html\">Dark Shikari advised<\/a> that unpack_vectors() (which decodes the motion vector segment of a frame) requires an inordinate amount of CPU time and that it &#8220;can be made at least twice as fast.&#8221; No further details, but Monsieur Shikari generally has a keen eye for these kinds of hot spots. He just likes to speak in riddles, sort of.<\/li>\n<li>Decode coefficients in Hilbert order: <a href=\"http:\/\/multimedia.cx\/eggs\/theora-1-1-released\/#comment-150123\">David Conrad<\/a> has been advocating this approach for a long time and even has some partially working code to that effect. I admit that I still don&#8217;t understand what this approach entails. Dconrad reports that his prototype code demonstrates substantial speedup, even after my recent refactoring.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Make VP3\/Theora even faster in FFmpeg<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[33],"tags":[],"class_list":["post-2191","post","type-post","status-publish","format-standard","hentry","category-vp3theora"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/2191","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/comments?post=2191"}],"version-history":[{"count":13,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/2191\/revisions"}],"predecessor-version":[{"id":4645,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/2191\/revisions\/4645"}],"wp:attachment":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/media?parent=2191"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/categories?post=2191"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/tags?post=2191"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}