Category Archives: General

The New Samples Regime

A little while ago, I got a big head over the fact that I owned and controlled the feared and revered MPlayer samples archive. This is the repository that retains more than a decade of multimedia samples.

Conflict
Where once there was one multimedia project (FFmpeg), there are now 2 (also Libav). There were various political and technical snafus regarding the previous infrastructure. I volunteered to take over hosting the vast samples archive (53 GB at the time) at samples.mplayerhq.hu (s.mphq for this post).

However, a brand new server is online at samples.libav.org (s.libav for this post).

Policies
The server at s.libav will be the authoritative samples repository going forward. Why does s.libav receive the honor? Mostly by virtue of having more advanced features. My simple (yet bandwidth-rich) web hosting plan does not provide for rsync or anonymous FTP services, both of which have traditionally been essential for the samples server. In the course of hosting s.mphq for the past few months, a few more discrepancies have come to light– apparently, the symlinks weren’t properly replicated. And perhaps most unusual is that if a directory contains a README file, it won’t be displayed in the directory listing (which frustrated me greatly when I couldn’t find this README file that I carefully and lovingly crafted years ago).

The s.mphq archive will continue to exist — nay, must exist — going forward since there are years’ worth of web links pointing into it. I’ll likely set up a mirroring script that periodically (daily) rsyncs from s.libav to my local machine and then uses lftp (the best facility I have available) to mirror the files up to s.mphq.

Also, since we’re starting fresh with a new upload directory, I think we need to be far more ruthless about policing its content. This means making sure that anything that is uploaded has an accompanying file which explains why it’s there and ideally links the sample to a bug report somewhere. No explanation = sample terminated.

RSS
I think it would be nifty to have an RSS feed that shows the latest samples to appear in the repository. I figure that I can use the Unix ‘find’ command on my local repository in concert with something like PyRSS2Gen to accomplish this goal.

Monetization
In the few months that I have been managing the repository, I have had numerous requests for permission to leech the entire collection in one recursive web-suck. These requests often from commercial organizations who wish to test their multimedia product on a large corpus of diverse samples. Personally, I believe the archive makes a rather poor corpus for such an endeavor, but so be it. Go ahead; hosting this archive barely makes a dent in my fairly low-end web hosting plan. However, at least one person indicated that it might be easier to mail a hard drive to me, have me copy it, and send it back.

This got me thinking about monetization opportunities. Perhaps, I should provide a service to send HDs filled with samples for the cost of the HD, shipping, and a small donation to the multimedia projects. I immediately realized that that is precisely the point at which the vast multimedia samples archive — with all of its media of questionable fair use status — would officially run afoul of copyright laws.

Which brings me to…

Clean Up
I think we need to clean up some samples, starting with the ones that were marked not-readable in the old repository. Apparently, some ‘samples’ were, e.g., full anime videos and were responsible for a large bandwidth burden when linked from various sources.

We multimedia nerds are a hoarding lot, never willing to throw anything away. This will probably the most challenging proposal to implement.

Playing With File

I played with the ‘file’ utility a long time ago because I wanted to make it recognize a large number of multimedia formats. I had trouble getting my changes to take. But I’m prepared to try again after many years.

Aiming at the Corpus
In my local mirror of the MPlayerHQ samples archive, I find 9853 unique files. So I run all of them through the ‘file’ command:

  'find /path/to/samples -type f -print0 | xargs -0 file --no-pad'

My Ubuntu installation has file v5.04. I also tested against 5.07 and the latest, 5.08. Here is the number of files each version was unable to identify (generically marking as ‘data’):

5.04  1521
5.07  1405
5.08  1501

That seems like a regression for v5.08 until I dug into the details and saw quite a few items like this, indicating that the MPEG detection could use some work:

-mov/mov-demux-infinite-loop.mpg: DOS-executable (
+mov/mov-demux-infinite-loop.mpg: data
-image-samples/UNeedQT4.pntg: DOS-executable (
+image-samples/UNeedQT4.pntg: data

Workflow
These are just notes to myself and perhaps anyone else who wants to add new file formats to be identified by the ‘file’ command.

First, download either the latest release from the FTP or clone from Github. Do the usual unpack, ‘./configure’, ‘make’ routine. To use this newly-built version and its associated magic file:

  ./src/file --magic-file magic/magic.mgc <file>

To add a new format for ID, first, run the foregoing command to ensure that it’s not already identified. Then, check over the files in magic/Magdir and see which one might pertain to what you’re doing (it’s unlikely that your format will merit a new file in this directory). For example, for this round, I modified animation, audio, iff, and riff. Add or modify existing specs based on the copious examples in the directory and by consulting the appropriate man page (‘man 5 magic’).

Finally, run ‘make’ again which will regenerate the magic file. Invoke the above command again to use the modified magic file.

Before and After
On a selection of formats taken from the samples archive (renamed and cut down to a kilobyte because detection typically only relies on the first few bytes), here is the “before”:

amv:            RIFF (little-endian) data
armovie:        data
bbc-dirac:      data
interplay-mve:  data
mtv:            data
nintendo-thp:   data
nullsoft-video: data
redcode:        data
sega-film:      data
smacker:        data
trueaudio:      data
vqa:            IFF data
wavpack:        data
wc3-mve:        IFF data
wtv:            data

And the “after”:

amv:            RIFF (little-endian) data, AMV 
armovie:        ARMovie
bbc-dirac:      BBC Dirac Video
interplay-mve:  Interplay MVE Movie
mtv:            MTV Multimedia File
nintendo-thp:   Nintendo THP Multimedia
nullsoft-video: Nullsoft Video
redcode:        REDCode Video
sega-film:      Sega FILM/CPK Multimedia, 320 x 224
smacker:        RAD Game Tools Smacker Multimedia version 2, 320 x 200, 100 frames
trueaudio:      True Audio Lossless Audio
vqa:            IFF data, Westwood Studios VQA Multimedia, 418 video frames, 320 x 200
wavpack:        WavPack Lossless Audio
wc3-mve:        IFF data, Wing Commander III Video, PC version
wtv:            Windows Television DVR Media

After rerunning ‘file’ on the mphq corpus using the modified magic file, only 1329 files remain unidentified (down from 1501).

Going Forward
As mentioned, MPEG detection could probably be strengthened. However, a major weakness is QuickTime/MP4. Many files are not detected, probably owing to the many ways that QuickTime files can begin.

Physical Calculus Education

I have never claimed to be especially proficient at math. I did take Advanced Placement calculus in my senior year of high school. While digging through some boxes, I found an old grade report from that high school year. I wondered what motivated me to save it. Maybe it’s because it offered this clue as to why I can’t perform adequately in math class:



Mystery solved: I did not wear proper P.E. attire to calculus class.

Further SMC Encoding Work

Sometimes, when I don’t feel like doing anything else, I look at that Apple SMC video encoder again.

8-bit Encoding
When I last worked on the encoder, I couldn’t get the 8-color mode working correctly, even though the similar 2- and 4-color modes were working fine. I chalked the problem up to the extreme weirdness in the packing method unique to the 8-color mode. Remarkably, I had that logic correct the first time around. The real problem turned out to be with the 8-color cache and it was due to the vagaries of 64-bit math in C. Bit shifting an unsigned 8-bit quantity implicitly results in a signed 32-bit quantity, or so I discovered.

Anyway, the 8-color encoding works correctly, thus shaving a few more bytes off the encoding size.

Encoding Scheme Oddities
The next step is to encode runs of data. This is where I noticed some algorithmic oddities in the scheme that I never really noticed before. There are 1-, 2-, 4-, 8-, and 16-color modes. Each mode allows encoding from 1-256 blocks of that same encoding. For example, the byte sequence:

  0x62 0x45

Specifies that the next 3 4×4 blocks are encoded with single-color mode (of byte 0x62, high nibble is encoding mode and low nibble is count-1 blocks) and the palette color to be used is 0x45. Further, opcode 0x70 is the same except the following byte allows for specifying more than 16 (i.e., up to 256) blocks shall be encoded in the same matter. In light of this repeat functionality being built into the rendering opcodes, I’m puzzled by the existence of the repeat block opcodes. There are opcodes to repeat the prior block up to 256 times, and there are opcodes to repeat the prior pair of blocks up to 256 times.

So my quandary is: What would the repeat opcodes be used for? I hacked the FFmpeg / Libav SMC decoder to output a histogram of which opcodes are used. The repeat pair opcodes are never seen. However, the single-repeat opcodes are used a few times.

Puzzle Solved?
I’m glad I wrote this post. Just as I was about to hit “Publish”, I think I figured it out. I haven’t mentioned the skip opcodes yet– there are opcodes that specify that 1-256 4×4 blocks are unchanged from the previous frame. Conceivably, a block could be unchanged from the previous frame and then repeated 1-256 times from there.

That’s something I hadn’t thought of up to this point for my proposed algorithm and will require a little more work.

Further reading