Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Archives:

The New Samples Regime

November 30th, 2011 by Multimedia Mike

A little while ago, I got a big head over the fact that I owned and controlled the feared and revered MPlayer samples archive. This is the repository that retains more than a decade of multimedia samples.

Conflict
Where once there was one multimedia project (FFmpeg), there are now 2 (also Libav). There were various political and technical snafus regarding the previous infrastructure. I volunteered to take over hosting the vast samples archive (53 GB at the time) at samples.mplayerhq.hu (s.mphq for this post).

However, a brand new server is online at samples.libav.org (s.libav for this post).

Policies
The server at s.libav will be the authoritative samples repository going forward. Why does s.libav receive the honor? Mostly by virtue of having more advanced features. My simple (yet bandwidth-rich) web hosting plan does not provide for rsync or anonymous FTP services, both of which have traditionally been essential for the samples server. In the course of hosting s.mphq for the past few months, a few more discrepancies have come to light– apparently, the symlinks weren’t properly replicated. And perhaps most unusual is that if a directory contains a README file, it won’t be displayed in the directory listing (which frustrated me greatly when I couldn’t find this README file that I carefully and lovingly crafted years ago).

The s.mphq archive will continue to exist — nay, must exist — going forward since there are years’ worth of web links pointing into it. I’ll likely set up a mirroring script that periodically (daily) rsyncs from s.libav to my local machine and then uses lftp (the best facility I have available) to mirror the files up to s.mphq.

Also, since we’re starting fresh with a new upload directory, I think we need to be far more ruthless about policing its content. This means making sure that anything that is uploaded has an accompanying file which explains why it’s there and ideally links the sample to a bug report somewhere. No explanation = sample terminated.

RSS
I think it would be nifty to have an RSS feed that shows the latest samples to appear in the repository. I figure that I can use the Unix ‘find’ command on my local repository in concert with something like PyRSS2Gen to accomplish this goal.

Monetization
In the few months that I have been managing the repository, I have had numerous requests for permission to leech the entire collection in one recursive web-suck. These requests often from commercial organizations who wish to test their multimedia product on a large corpus of diverse samples. Personally, I believe the archive makes a rather poor corpus for such an endeavor, but so be it. Go ahead; hosting this archive barely makes a dent in my fairly low-end web hosting plan. However, at least one person indicated that it might be easier to mail a hard drive to me, have me copy it, and send it back.

This got me thinking about monetization opportunities. Perhaps, I should provide a service to send HDs filled with samples for the cost of the HD, shipping, and a small donation to the multimedia projects. I immediately realized that that is precisely the point at which the vast multimedia samples archive — with all of its media of questionable fair use status — would officially run afoul of copyright laws.

Which brings me to…

Clean Up
I think we need to clean up some samples, starting with the ones that were marked not-readable in the old repository. Apparently, some ‘samples’ were, e.g., full anime videos and were responsible for a large bandwidth burden when linked from various sources.

We multimedia nerds are a hoarding lot, never willing to throw anything away. This will probably the most challenging proposal to implement.

Posted in General | 6 Comments »

6 Responses

  1. James Says:

    Apologies if I have misunderstood, but if s.mphq will be an exact mirror of s.libav (or a superset of it?) why not configure s.mphq to simply send an HTTP 301 code back for every request, redirecting it to s.libav?

    That would be trivial to do with an Apache mod_rewrite rule, for example.
    Alternatively (and this is where I suspect I’ve missed something) if they’re both hosting the exact same content, why not point the s.mphq DNS entry at s.libav?

    There is of course no harm in having the same content available in two completely distinct locations though.

  2. Multimedia Mike Says:

    @James: Valid questions, all. The first thing to understand is that the people who administer the mphq domain *really* do not like the people involved with the Libav project. So, s.mphq will never point directly to s.libav.

    I could probably reconfigure s.mphq to point to s.libav via 301 codes. However, I think it’s valuable to have 2 distinct servers serving up this data. In the best case, there would be one address and a round robin DNS to help balance traffic load (the original server was reported to serve 2-3 TB of samples traffic/month but I haven’t seen anywhere close to that yet).

  3. Ronald S. Bultje Says:

    In addition to what Mike says, I believe it’s important to have backups and mirrors in case a glitch takes down libav.org (or s.mphq). More mirrors means more merriness (the M5 rule).

  4. compn Says:

    last i checked, distributing copyrighted clips was OK (in usa copyright law) under fair use for educational / non profit (and creating a hd for just shipping and handling fee only is by definition, non profit). but i am not a lawyer and you shouldnt take this as legal advice blah blah blah.

    that said, any full length movies should be made private, since those may ‘harm the original work’ etc.

  5. Steve Says:

    Can you run ffprobe and get a list of everything over 5 minutes? That should clear out the biggest problems there. (If you do that, I would really like to see the script).

  6. Multimedia Mike Says:

    @Steve: Interesting idea, and it doesn’t sound too involved to implement.