Let’s examine the types of tests I am deploying in the next revision of FATE, their specific syntax, and how they will be executed, both locally and remotely. Read through these specs and see if your idea of how to test FFmpeg is already listed. Otherwise, please leave a comment discussing more tests.
This is a long one… Continue reading
I staged 19 new FATE tests today which will finally push FATE over 300 individual test specs once I activate them tomorrow. The first 12 were a dozen more fidelity range extension (“FRExt”) H.264 conformance vectors. Thanks to Carl Eugen Hoyos for doing the validation on these vectors and informing me of the right command line options to get the output correct.
These are the other tests I entered:
I’d also like the recognize Michael K. once more for his FATE contributions in the testing “other” platforms category. In the last week, he started contributing FATE results for x86_64 variations of both OpenSolaris and OpenBSD.
No sooner did I press “Publish” on my last post pertaining to multithreading FFmpeg’s Theora decoder, than did I receive an email from the theora-dev list regarding the release of Theora 1.1.0. It took them many, many years to release the first official version and about 10 months to get the second version out, so congratulations on that matter. This release includes the much-vaunted Thusnelda encoder which is supposed to offer substantial encoding improvements vs. the original 1.0 encoder.
So, fair warning: Be prepared for a new round of “Theora Bests H.264 / HTML5 Video Poised To Conquer Internet” type of stories.
Since I have been doing a bunch of optimizations to the FFmpeg Theora decoder this past week (a.k.a. the Theora decoder that most people actually use), I thought this would be the perfect opportunity to benchmark Theora 1.1 alongside FFmpeg’s decoder. Fortunately, libtheora has an example tool called dump_video that decodes video directly to YUV4MPEG2 format, the same way I was testing FFmpeg’s decoder.
FFmpeg command line:
ffmpeg -threads n -i big_buck_bunny_1080p_stereo.ogg
-f yuv4mpegpipe -an -y /dev/null
Libtheora command line:
dump_video big_buck_bunny_1080p_stereo.ogg > /dev/null
The results (on my Core 2 Duo Mac Mini) were thus:
6:44 - FFmpeg, 1 thread
6:09 - FFmpeg, 2 threads *
4:51 - libtheora 1.1
* multithreaded version isn’t complete yet
Mind you, libtheora’s decoder is singly-threaded and only has basic MMX SIMD optimizations. After seeing libtheora’s relative performance, I think I blacked out. Or maybe I just went to bed since it was so late; it’s sort of a blur. I awoke in a confused stupor wondering what I’m doing wrong in the FFmpeg Theora decoder. Why is it so slow? Actually, I know why– unpack_vlcs(), which continues to dominate profiling statistics. Perhaps the question I should start with is, how does libtheora unpack VLCs so quickly? That’s a good jumping-off point for a future investigation.
As briefly mentioned in my last Theora post, I think FFmpeg’s Theora decoder can exploit multiple CPUs in a few ways: 1) Perform all of the DC prediction reversals in a separate thread while the main thread is busy decoding the AC coefficients (meanwhile, I have committed an optimization where the reversal occurs immediately after DC decoding in order to exploit CPU cache); 2) create n separate threads and assign each (num_slices / n) slices to decode (where a slice is a row of the image that is 16 pixels high).
So there’s the plan. Now, how to take advantage of FFmpeg’s threading API (which supports POSIX threads, Win32 threads, BeOS threads, and even OS/2 threads)? Would it surprise you to learn that this aspect is not extensively documented? Time to reverse engineer the API.
I also did some Googling regarding multithreaded FFmpeg. I mostly found forum posts complaining that FFmpeg isn’t effectively leveraging however many n cores someone’s turbo-charged machine happens to present to the OS, as demonstrated by their CPU monitoring tool. Since I suspect this post will rise in Google’s top search hits on the topic, allow me to apologize to searchers in advance by explaining that multimedia processing, while certainly CPU-intensive, does not necessarily lend itself to multithreading/multiprocessing. There are a few bits here and there in the encode or decode processes that can be parallelized but the entire operation overall tends to be rather serial.
So this is the goal:
…to see FFmpeg break through the 99.9% barrier in the CPU monitor. As an aside, it briefly struck me as ironic that people want
FFmpeg to use as much of as many available CPUs as possible but scorn
the project from my day job
for being quite capable of doing the same.
Moving right along, let’s see what can be done about exploiting what limited multithreading opportunities that Theora affords.
First off: it’s necessary to explicitly enable threading at configure-time (e.g., “–enable-pthreads” for POSIX threads on Unix flavors). Not sure why this is, but there it is.