{"id":1985,"date":"2009-12-01T00:05:50","date_gmt":"2009-12-01T08:05:50","guid":{"rendered":"http:\/\/multimedia.cx\/eggs\/?p=1985"},"modified":"2020-07-25T23:48:13","modified_gmt":"2020-07-26T06:48:13","slug":"one-gluttonous-if-statement","status":"publish","type":"post","link":"https:\/\/multimedia.cx\/eggs\/one-gluttonous-if-statement\/","title":{"rendered":"One Gluttonous if Statement"},"content":{"rendered":"<p>I&#8217;m still haunted by the slow VP3\/Theora decoder in <a href=\"http:\/\/ffmpeg.org\/\">FFmpeg<\/a>. While profiling reveals that most of the decoding time is spent in a function named unpack_vlcs(), I have been informed that finer-grained profiling indicates that the vast majority of <em>that<\/em> time is spent evaluating the following if statement:<\/p>\n<pre>\r\n  if (coeff_counts[fragment_num] > coeff_index)\r\n    continue;\r\n<\/pre>\n<p>I put counters before and after that statement and ran the good old <a href=\"http:\/\/www.bigbuckbunny.org\/\">Big Buck Bunny<\/a> 1080p movie through it. These numbers indicate, per frame, the number of times the if statement is evaluated and the number of times it evaluates to true:<\/p>\n<pre>\r\n[theora @ 0x1004000]3133440, 50643\r\n[theora @ 0x1004000]15360, 722\r\n[theora @ 0x1004000]15360, 720\r\n[theora @ 0x1004000]0, 0\r\n[theora @ 0x1004000]13888, 434\r\n[theora @ 0x1004000]631424, 10711\r\n[theora @ 0x1004000]2001344, 36922\r\n[theora @ 0x1004000]1298752, 22897\r\n[...]\r\n<\/pre>\n<p>Thus, while decoding the first frame, the if statement is evaluated over 3 million times but further action (i.e., decoding a coefficient) is only performed 50,000 times. Based on the above sample, each frame sees about 2 orders of magnitude more evaluations than are necessary.<\/p>\n<p><em>Clearly, that if statement needs to go.<\/em><\/p>\n<p><!--more--><\/p>\n<p>A frame of VP3\/Theora video is comprised of a series of 8&#215;8-sample fragments spread across the Y, U, and V frames. As the VLCs are decoded, the fragments are filled in with 64 coefficients. Most of the coefficients are supposed to be zero (this assumption is a big part of the compression). The algorithm decodes all of the 0th (DC) coefficients for all fragments first, then all the 1st (AC) coefficients, then 2nd, and so on down to all 63rd coefficients. However, since many fragments encode well fewer than 64 non-zero coefficients, many fragments are marked as end-of-block (EOB) after decoding the last non-zero coefficient for the fragment. Whether a fragment has been completely decoded yet is the condition that the gluttonous if statement is evaluating.<\/p>\n<p><center><br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/11\/theora-fragment-array.png\" alt=\"Theora fragment array\" title=\"Theora fragment array\" width=\"329\" height=\"39\" class=\"aligncenter size-full wp-image-1987\" srcset=\"https:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/11\/theora-fragment-array.png 329w, https:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/11\/theora-fragment-array-300x35.png 300w\" sizes=\"auto, (max-width: 329px) 100vw, 329px\" \/><br \/>\n<\/center><\/p>\n<p>The decoder essentially maintains an array indicating &#8212; true or false &#8212; whether a fragment&#8217;s coefficients have yet to be fully decoded (not exactly true, but close enough). Here&#8217;s a crazy idea drawing from concepts presented in any introductory computer science course: use a list instead of an array. In doing so, the decoder can drop fragments from the list once they have reached EOB and then only traverse the fragments that still have coefficients in the bitstream.<\/p>\n<p><center><br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/11\/theora-fragment-list.png\" alt=\"Theora fragment list\" title=\"Theora fragment list\" width=\"332\" height=\"76\" class=\"aligncenter size-full wp-image-1988\" srcset=\"https:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/11\/theora-fragment-list.png 332w, https:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/11\/theora-fragment-list-300x68.png 300w\" sizes=\"auto, (max-width: 332px) 100vw, 332px\" \/><br \/>\n<\/center><\/p>\n<p>I feel a little silly for not having thought of this sooner. But to be fair, no one else did, either. Though it&#8217;s not unreasonable to think that someone did, but was justifiably daunted by the cascading data structures in place for the decoding process.<\/p>\n<p>So I implemented the list. I&#8217;ll spare you the fine details except to mention that I can&#8217;t eliminate the if statement entirely. But I did reduce its iterations drastically. As a first level performance measurement, I checked how many iterations are now performed vs. before. Compare these numbers with those from the beginning of the post:<\/p>\n<pre>\r\n[theora @ 0x1004000]57841, 50643\r\n[theora @ 0x1004000]5280, 722\r\n[theora @ 0x1004000]5280, 720\r\n[theora @ 0x1004000]0, 0\r\n[theora @ 0x1004000]868, 434\r\n[theora @ 0x1004000]17679, 10711\r\n[theora @ 0x1004000]37659, 36922\r\n[theora @ 0x1004000]25435, 22897\r\n<\/pre>\n<p>Tens of thousands of iterations rather than several million? Doing well so far. Let&#8217;s evaluate the performance on more of a macro level by measuring wall clock execution time while decoding the entire Big Buck Bunny 1080p short, best run out of 2, between FFmpeg (before and after optimization) as well as libtheora 1.1. Command lines:<\/p>\n<pre>\r\ntime ffmpeg -an -i big_buck_bunny_1080p_stereo.ogg -f null -y \/dev\/null\r\ntime dump_video big_buck_bunny_1080p_stereo.ogg > \/dev\/null\r\n<\/pre>\n<p>The results on my Core 2 Duo 2.0 GHz Mac Mini (Mac OS X 10.5) were thus:<\/p>\n<pre>\r\n6:36 - FFmpeg, before list optimization\r\n5:06 - FFmpeg, after list optimization\r\n4:22 - libtheora 1.1\r\n<\/pre>\n<p>So, substantial speedup, if I do say so myself.<\/p>\n<p>A curious item about the data is the user vs. sys times. Here is the raw &#8216;time&#8217; output for each of the best runs:<\/p>\n<pre>\r\nFFmpeg, before VP3 list optimization:\r\nreal    6m36.337s\r\nuser    6m18.659s\r\nsys     0m16.217s\r\n\r\nFFmpeg, after VP3 list optimations:\r\nreal    5m6.914s\r\nuser    4m50.535s\r\nsys     0m15.838s\r\n\r\nlibtheora:\r\nreal    4m22.110s\r\nuser    4m19.744s\r\nsys     0m1.756s\r\n<\/pre>\n<p>Evaluating performance based on user time instead of real time further narrows the gap between FFmpeg and libtheora. I&#8217;m open to theories (and hard data) explaining the wide discrepancy between the sys times for each program. I also need to note that this round of testing took into account <a href=\"http:\/\/multimedia.cx\/eggs\/theora-1-1-released\/#comment-150123\">dconrad&#8217;s advice from the last post on this matter<\/a> which is why I tested FFmpeg using &#8216;-f null&#8217; instead of &#8216;-f yuv4mpegpipe&#8217; while modifying libtheora&#8217;s dump_video.c program to comment out this fwrite() call:<\/p>\n<pre>\r\nfwrite(ycbcr[pli].data+ycbcr[pli].stride*i, 1,\r\n  ycbcr[pli].width, outfile);\r\n<\/pre>\n<p>I hope to get this patch approved and into FFmpeg SVN soon. It will be interesting to profile the VP3\/Theora decoder afterwards since, ideally, unpack_vlcs() will not dominate the statistics as before. We might find some other notable hotspots.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Making FFmpeg&#8217;s VP3\/Theora decoder waaaaaay faster<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[28,33],"tags":[],"class_list":["post-1985","post","type-post","status-publish","format-standard","hentry","category-programming","category-vp3theora"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/1985","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/comments?post=1985"}],"version-history":[{"count":18,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/1985\/revisions"}],"predecessor-version":[{"id":4643,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/1985\/revisions\/4643"}],"wp:attachment":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/media?parent=1985"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/categories?post=1985"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/tags?post=1985"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}