Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Archives:

One Gluttonous if Statement

December 1st, 2009 by Multimedia Mike

I’m still haunted by the slow VP3/Theora decoder in FFmpeg. While profiling reveals that most of the decoding time is spent in a function named unpack_vlcs(), I have been informed that finer-grained profiling indicates that the vast majority of that time is spent evaluating the following if statement:

  1. if (coeff_counts[fragment_num] > coeff_index)
  2.     continue;

I put counters before and after that statement and ran the good old Big Buck Bunny 1080p movie through it. These numbers indicate, per frame, the number of times the if statement is evaluated and the number of times it evaluates to true:

[theora @ 0x1004000]3133440, 50643
[theora @ 0x1004000]15360, 722
[theora @ 0x1004000]15360, 720
[theora @ 0x1004000]0, 0
[theora @ 0x1004000]13888, 434
[theora @ 0x1004000]631424, 10711
[theora @ 0x1004000]2001344, 36922
[theora @ 0x1004000]1298752, 22897
[...]

Thus, while decoding the first frame, the if statement is evaluated over 3 million times but further action (i.e., decoding a coefficient) is only performed 50,000 times. Based on the above sample, each frame sees about 2 orders of magnitude more evaluations than are necessary.

Clearly, that if statement needs to go.

A frame of VP3/Theora video is comprised of a series of 8×8-sample fragments spread across the Y, U, and V frames. As the VLCs are decoded, the fragments are filled in with 64 coefficients. Most of the coefficients are supposed to be zero (this assumption is a big part of the compression). The algorithm decodes all of the 0th (DC) coefficients for all fragments first, then all the 1st (AC) coefficients, then 2nd, and so on down to all 63rd coefficients. However, since many fragments encode well fewer than 64 non-zero coefficients, many fragments are marked as end-of-block (EOB) after decoding the last non-zero coefficient for the fragment. Whether a fragment has been completely decoded yet is the condition that the gluttonous if statement is evaluating.


Theora fragment array

The decoder essentially maintains an array indicating — true or false — whether a fragment’s coefficients have yet to be fully decoded (not exactly true, but close enough). Here’s a crazy idea drawing from concepts presented in any introductory computer science course: use a list instead of an array. In doing so, the decoder can drop fragments from the list once they have reached EOB and then only traverse the fragments that still have coefficients in the bitstream.


Theora fragment list

I feel a little silly for not having thought of this sooner. But to be fair, no one else did, either. Though it’s not unreasonable to think that someone did, but was justifiably daunted by the cascading data structures in place for the decoding process.

So I implemented the list. I’ll spare you the fine details except to mention that I can’t eliminate the if statement entirely. But I did reduce its iterations drastically. As a first level performance measurement, I checked how many iterations are now performed vs. before. Compare these numbers with those from the beginning of the post:

[theora @ 0x1004000]57841, 50643
[theora @ 0x1004000]5280, 722
[theora @ 0x1004000]5280, 720
[theora @ 0x1004000]0, 0
[theora @ 0x1004000]868, 434
[theora @ 0x1004000]17679, 10711
[theora @ 0x1004000]37659, 36922
[theora @ 0x1004000]25435, 22897

Tens of thousands of iterations rather than several million? Doing well so far. Let’s evaluate the performance on more of a macro level by measuring wall clock execution time while decoding the entire Big Buck Bunny 1080p short, best run out of 2, between FFmpeg (before and after optimization) as well as libtheora 1.1. Command lines:

time ffmpeg -an -i big_buck_bunny_1080p_stereo.ogg -f null -y /dev/null
time dump_video big_buck_bunny_1080p_stereo.ogg > /dev/null

The results on my Core 2 Duo 2.0 GHz Mac Mini (Mac OS X 10.5) were thus:

6:36 - FFmpeg, before list optimization
5:06 - FFmpeg, after list optimization
4:22 - libtheora 1.1

So, substantial speedup, if I do say so myself.

A curious item about the data is the user vs. sys times. Here is the raw ‘time’ output for each of the best runs:

FFmpeg, before VP3 list optimization:
real    6m36.337s
user    6m18.659s
sys     0m16.217s

FFmpeg, after VP3 list optimations:
real    5m6.914s
user    4m50.535s
sys     0m15.838s

libtheora:
real    4m22.110s
user    4m19.744s
sys     0m1.756s

Evaluating performance based on user time instead of real time further narrows the gap between FFmpeg and libtheora. I’m open to theories (and hard data) explaining the wide discrepancy between the sys times for each program. I also need to note that this round of testing took into account dconrad’s advice from the last post on this matter which is why I tested FFmpeg using ‘-f null’ instead of ‘-f yuv4mpegpipe’ while modifying libtheora’s dump_video.c program to comment out this fwrite() call:

  1. fwrite(ycbcr[pli].data+ycbcr[pli].stride*i, 1,
  2.   ycbcr[pli].width, outfile);

I hope to get this patch approved and into FFmpeg SVN soon. It will be interesting to profile the VP3/Theora decoder afterwards since, ideally, unpack_vlcs() will not dominate the statistics as before. We might find some other notable hotspots.

Posted in Programming, VP3/Theora | 7 Comments »

7 Responses

  1. Kostya Says:

    Gee, and remember that issue you commented on – about lavc APE decoder being slower than reference one? More fine-tuned profiling is definitely a need.

  2. Phil Says:

    A short strace of your above ffmpeg command reveils a lot of select()’s on FD 1, I guess that is for handling the case when the user presses q to abort.

    For my test file 106942 out of 112057 system-calls have been to the select function, thats more then 95%.

    Another issue is that local files should be read with mmap and the madvise flag set to MADV_SEQUENTIAL, instead regular read()’s are used.

    Just my 2 cents :)

  3. Vitor Says:

    Your patch seems to introduce a valgrind error:

    vitor@vitor-laptop:~/ffmpeg/ffmpeg.4$ valgrind ./ffmpeg_g -i ~/ffmpeg/tests.old/big_buck_bunny_1080p_stereo.ogg -an -vcodec rawvideo -f rawvideo -y /dev/null
    ==12866== Memcheck, a memory error detector
    ==12866== Copyright (C) 2002-2009, and GNU GPL’d, by Julian Seward et al.
    ==12866== Using Valgrind-3.5.0-Debian and LibVEX; rerun with -h for copyright info
    ==12866== Command: ./ffmpeg_g -i /home/vitor/ffmpeg/tests.old/big_buck_bunny_1080p_stereo.ogg -an -vcodec rawvideo -f rawvideo -y /dev/null
    ==12866==
    FFmpeg version SVN-r20680, Copyright (c) 2000-2009 Fabrice Bellard, et al.
    built on Dec 1 2009 13:50:36 with gcc 4.4.1
    configuration: –cc=’ccache gcc’ –cpu=host
    libavutil 50. 5. 1 / 50. 5. 1
    libavcodec 52.42. 0 / 52.42. 0
    libavformat 52.39. 2 / 52.39. 2
    libavdevice 52. 2. 0 / 52. 2. 0
    libswscale 0. 7. 2 / 0. 7. 2
    [theora @ 0x4484480]7 bits left in packet 82
    Input #0, ogg, from ‘/home/vitor/ffmpeg/tests.old/big_buck_bunny_1080p_stereo.ogg’:
    Duration: 00:09:56.45, start: 0.000000, bitrate: 12191 kb/s
    Stream #0.0: Video: theora, yuv420p, 1920×1080, 24 tbr, 24 tbn, 24 tbc
    Stream #0.1: Audio: vorbis, 48000 Hz, stereo, s16, 192 kb/s
    [theora @ 0x4484480]7 bits left in packet 82
    Output #0, rawvideo, to ‘/dev/null’:
    Stream #0.0: Video: rawvideo, yuv420p, 1920×1080, q=2-31, 200 kb/s, 90k tbn, 24 tbc
    Stream mapping:
    Stream #0.0 -> #0.0
    Press [q] to stop encoding
    ==12866== Invalid write of size 4 0kB time=0.04 bitrate= 0.2kbits/s
    ==12866== at 0x8390F59: vp3_decode_frame (vp3.c:746)
    ==12866== by 0x81150F4: avcodec_decode_video2 (utils.c:582)
    ==12866== by 0x806FB0F: output_packet (ffmpeg.c:1322)
    ==12866== Address 0x473203c is 4 bytes before a block of size 195,840 alloc’d
    ==12866== at 0x4023E2E: memalign (vg_replace_malloc.c:532)
    ==12866== by 0x4023E8B: posix_memalign (vg_replace_malloc.c:660)
    ==12866== by 0x8525350: av_malloc (mem.c:66)
    ==12866== by 0x8066DAB: theora_decode_init (vp3.c:2415)
    ==12866== by 0x8116B5D: avcodec_open (utils.c:487)
    ==12866== by 0x80737B9: T.706 (ffmpeg.c:2050)
    ==12866==
    frame= 2 fps= 1 q=0.0 Lsize= 0kB time=0.08 bitrate= 0.1kbits/s
    video:6075kB audio:0kB global headers:0kB muxing overhead -99.999984%
    ==12866==
    ==12866== HEAP SUMMARY:
    ==12866== in use at exit: 3,758 bytes in 3 blocks
    ==12866== total heap usage: 772 allocs, 769 frees, 80,916,757 bytes allocated
    ==12866==
    ==12866== LEAK SUMMARY:
    ==12866== definitely lost: 3,758 bytes in 3 blocks
    ==12866== indirectly lost: 0 bytes in 0 blocks
    ==12866== possibly lost: 0 bytes in 0 blocks
    ==12866== still reachable: 0 bytes in 0 blocks
    ==12866== suppressed: 0 bytes in 0 blocks
    ==12866== Rerun with –leak-check=full to see details of leaked memory
    ==12866==
    ==12866== For counts of detected and suppressed errors, rerun with: -v
    ==12866== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 33 from 8)

  4. compn Says:

    well done!

  5. Multimedia Mike Says:

    Vitor: Thanks for the valgrind intelligence. The statement in question is a pretty silly logic error on my part.

  6. Pengvado Says:

    @Kostya
    What? ffape is 10-25% faster than the reference implementation on x86_64. (10% for low compression levels dominated by entropy decoding, 25% for high compression levels dominated by lpc)

    50% of entropy decoding (30% of the total on the low end) is divisions. And unlike lagarith, I don’t know how to get rid of these.

  7. Multimedia Mike Says:

    @Pengvado: The APE thing references FFmpeg issue 1542 where a user discovered that FFmpeg’s implementation is substantially slower than another one which implies room for improvement.