{"id":1850,"date":"2009-09-24T22:29:58","date_gmt":"2009-09-25T05:29:58","guid":{"rendered":"http:\/\/multimedia.cx\/eggs\/?p=1850"},"modified":"2020-07-25T23:46:08","modified_gmt":"2020-07-26T06:46:08","slug":"multithreaded-ffmpeg-programming","status":"publish","type":"post","link":"https:\/\/multimedia.cx\/eggs\/multithreaded-ffmpeg-programming\/","title":{"rendered":"Multithreaded FFmpeg Programming"},"content":{"rendered":"<p>As briefly mentioned in <a href=\"http:\/\/multimedia.cx\/eggs\/optimizing-away-arrows\/\">my last Theora post<\/a>, I think <a href=\"http:\/\/ffmpeg.org\/\">FFmpeg&#8217;s<\/a> Theora decoder can exploit multiple CPUs in a few ways: 1) Perform all of the DC prediction reversals in a separate thread while the main thread is busy decoding the AC coefficients (meanwhile, I have committed an optimization where the reversal occurs immediately after DC decoding in order to exploit CPU cache); 2) create <em>n<\/em> separate threads and assign each <em>(num_slices \/ n)<\/em> slices to decode (where a slice is a row of the image that is 16 pixels high).<\/p>\n<p>So there&#8217;s the plan. Now, how to take advantage of FFmpeg&#8217;s threading API (which supports POSIX threads, Win32 threads, BeOS threads, and even OS\/2 threads)? Would it surprise you to learn that this aspect is not extensively documented? Time to reverse engineer the API.<\/p>\n<p>I also did some Googling regarding multithreaded FFmpeg. I mostly found forum posts complaining that FFmpeg isn&#8217;t effectively leveraging however many <em>n<\/em> cores someone&#8217;s turbo-charged machine happens to present to the OS, as demonstrated by their CPU monitoring tool. Since I suspect this post will rise in Google&#8217;s top search hits on the topic, <strong>allow me to apologize to searchers in advance by explaining<\/strong> that multimedia processing, while certainly CPU-intensive, does not necessarily lend itself to multithreading\/multiprocessing. There are a few bits here and there in the encode or decode processes that can be parallelized but the entire operation overall tends to be rather serial.<\/p>\n<p>So this is the goal:<br \/>\n<center><br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/09\/ffmpeg-more-than-100-percent.png\" alt=\"Mac OS X Activity Monitor showing FFmpeg using more than 100% CPU\" title=\"Mac OS X Activity Monitor showing FFmpeg using more than 100% CPU\" width=\"399\" height=\"75\" class=\"aligncenter size-full wp-image-1857\" srcset=\"https:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/09\/ffmpeg-more-than-100-percent.png 399w, https:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/09\/ffmpeg-more-than-100-percent-300x56.png 300w\" sizes=\"auto, (max-width: 399px) 100vw, 399px\" \/><br \/>\n<\/center><br \/>\n&#8230;to see FFmpeg break through the 99.9% barrier in the CPU monitor. As an aside, it briefly struck me as ironic that people <em>want<\/em>  FFmpeg to use as much of as many available CPUs as possible but <em>scorn<\/em> the <a href=\"http:\/\/blogs.adobe.com\/penguin.swf\/\">project from my day job<\/a> for being quite capable of doing the same.<\/p>\n<p><center><br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/multimedia.cx\/eggs\/wp-content\/uploads\/2009\/09\/120px-Biggrin-smiley.svg.png\" alt=\"Big, fat, grinning smiley face\" title=\"Big, fat, grinning smiley face\" width=\"80\" height=\"80\" class=\"aligncenter size-full wp-image-1858\" \/><br \/>\n<\/center><\/p>\n<p>Moving right along, let&#8217;s see what can be done about exploiting what limited multithreading opportunities that Theora affords.<\/p>\n<p>First off: it&#8217;s necessary to explicitly enable threading at configure-time (e.g., &#8220;&#8211;enable-pthreads&#8221; for POSIX threads on Unix flavors). Not sure why this is, but there it is.<\/p>\n<p><!--more--> <\/p>\n<p>Next: FFmpeg apparently allocates a thread pool during initialization based on the -threads command line parameter. Codecs can exploit these threads by loading up an array of thread-specific context data structures and calling AVCodecContext::execute() with the context array, the number of threads to use, and a function to execute in each of those threads.<\/p>\n<p>Here is a working, simple example of a thread context data structure and a thread function (works circa FFmpeg SVN 20000):<br \/>\n<script src=\"https:\/\/gist.github.com\/multimediamike\/ad44e8c5391da2159677e82c4d034f9a.js\"><\/script><\/p>\n<p>Then, in some function from the main thread:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/multimediamike\/ddc01a77b29e6516a5b99d892218f3d4.js\"><\/script><\/p>\n<p>And here is the relevant output using &#8216;-threads 4&#8217;:<\/p>\n<p>[theora @ 0x1004000]there are 4 threads available<br \/>\n[theora @ 0x1004000]sending work to the threads and waiting&#8230;<br \/>\n[theora @ 0x1004000]sleeping for 3 seconds&#8230;<br \/>\n[theora @ 0x1004000]sleeping for 2 seconds&#8230;<br \/>\n[theora @ 0x1004000]sleeping for 5 seconds&#8230;<br \/>\n[theora @ 0x1004000]sleeping for 5 seconds&#8230;<br \/>\n[theora @ 0x1004000]done sleeping for 2 seconds<br \/>\n[theora @ 0x1004000]done sleeping for 3 seconds<br \/>\n[theora @ 0x1004000]done sleeping for 4 seconds<br \/>\n[theora @ 0x1004000]done sleeping for 5 seconds<br \/>\n[theora @ 0x1004000]thread 0 returned 2<br \/>\n[theora @ 0x1004000]thread 1 returned 3<br \/>\n[theora @ 0x1004000]thread 2 returned 4<br \/>\n[theora @ 0x1004000]thread 3 returned 5<\/p>\n<p>Those &#8220;sleeping for n seconds&#8230;&#8221; lines are a bit suspicious. Could it be that av_log() is not thread-safe (i.e., reentrant)? The rest of the output seems to indicate that the threading is working as expected. At least, that&#8217;s my story and I&#8217;m sticking to it unless someone can kindly demonstrate that there&#8217;s a deeper bug. Fortunately, I don&#8217;t intend to do a lot of terminal output from my worker threads.<\/p>\n<p>Emboldened by this simple experiment, I proceeded to multithread the VP3\/Theora renderer. This is the relevant loop:<\/p>\n<p><script src=\"https:\/\/gist.github.com\/multimediamike\/cfaa9e8fe6d9b88097012b42983b5647.js\"><\/script><\/p>\n<p>So the idea is to break that up into n threads where each thread renders (macroblock_height \/ n) slices. I almost have it working and it is measurably faster. Many frames seem to be correct but there are other frames that don&#8217;t pass my tests. So far, I have been performing optimization validation numerically rather than visually&#8211; I captured the framecrc output of 3 VP3\/Theora files from before I started optimization work and have been comparing framecrc output after optimization. Some, but not all, of the CRCs aren&#8217;t passing when multithreaded.<\/p>\n<p>Here&#8217;s a critical lesson I have learned about optimization: <strong>Don&#8217;t trust any profiling statistics until the code is actually correct.<\/strong> I&#8217;ve seen bugs go both for and against my favor in this respect. (I once had an optimization that mostly worked but was only marginally faster than the original code; when I ironed out a few minor bugs, it was more than 3x faster. On the other end of the spectrum, my first pass at this multithreaded renderer was impressively fast&#8211; until I noticed that it was only rendering the first 1\/n part of the frame). With that in mind, I am seeing notable speedups with this multithreaded rendering. From the above CPU monitor screenshot, FFmpeg is using more than 1 CPU (and somehow spawns 7 threads with &#8216;-threads 2&#8217;). Decoding the first 60 seconds of the 1080p version of Big Buck Bunny went from 52.0s -&gt; 47.7s (wall clock time) on my Core 2 Duo 2.0 GHz Mac Mini. Be advised that this doesn&#8217;t necessarily translate into realtime HD Theora playback. The command line I am using is:<\/p>\n<pre>\r\nffmpeg -threads 2 -i big_buck_bunny_1080p_stereo.ogg \r\n  -f yuv4mpegpipe -an -t 60 -y \/dev\/null\r\n<\/pre>\n<p>which decodes only the first minute of video (no audio) to raw data and dumps it directly to \/dev\/null.<\/p>\n<p>So, when FFmpeg wants to multithread a program, it kicks off a set of homogenous functions in n threads (and the main thread blocks while the worker threads are busy). Is this the same spirit as <a href=\"http:\/\/en.wikipedia.org\/wiki\/Functional_programming\">functional programming<\/a>? That&#8217;s a topic I have been wanting to study closer for a long time. FFmpeg&#8217;s model, however, makes it a bit difficult to follow through with my idea of processing the DC coefficient reversal in a separate thread after the quantized DC coefficients are decoded. It is possible that I could create 1 function to run in multiple threads that would do different things depending on the context data I pass in. I need to be sure that the threads don&#8217;t need to synchronize in any way since FFmpeg&#8217;s threading API has no provisions for that.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I managed to get part of FFmpeg&#8217;s Theora decoder partially multithreaded, and it seems to be mostly correct<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[33],"tags":[],"class_list":["post-1850","post","type-post","status-publish","format-standard","hentry","category-vp3theora"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/1850","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/comments?post=1850"}],"version-history":[{"count":19,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/1850\/revisions"}],"predecessor-version":[{"id":4641,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/1850\/revisions\/4641"}],"wp:attachment":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/media?parent=1850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/categories?post=1850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/tags?post=1850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}