Designing A Download Strategy

The uncommon video codecs list mentioned in the last post is amazing. Here are some FourCCs I have never heard of before: 3ivd, abyr, acdv, aura, brco, bt20, bw10, cfcc, cfhd, digi, dpsh, dslv, es07, fire, g2m3, gain, geox, imm4, inmc, mohd, mplo, qivg, suvf, ty0n, xith, xplo, and zdsv. There are several that have been found to be variations of other codecs. And there are some that were only rumored to exist, such as aflc as a codec for storing FLIC data in an AVI container, and azpr as an alternate FourCC for rpza. We now have samples. The existence of many of these FourCCs has, in fact, been cataloged on FourCC.org. But I was always reticent to document the FourCCs in the MultimediaWiki unless I could find either samples or a binary codec.

But how to obtain all of these samples?

Do you ever download files from the internet? Of course you do. Do you ever download a bunch of files at a time? Maybe. But have you ever had to download a few thousand files?

I have some experience to guide me in this. I didn’t get away from dialup internet access until mid-2004, even though I had been hacking on multimedia for several years which necessitates downloading bulky samples. When I had to download a number of large files over the web, I made a text file with a series of wget commands:

  wget --continue http://movietrailersite.com/trailer1.mov
  wget --continue http://movietrailersite.com/trailer2.mov

…etc. The –continue option on each line was useful because I could break and subsequently resume the script any time I needed my 3.5 kbytes/sec downstream capacity for a higher-priority task.

So my first impulse for downloading the list of 5100 crashers was to put ‘wget –continue’ in front of each one and let it rip. Since there are more than a few slow, confused, or misbehaving servers out there, it wasn’t long before I noticed that I should limit the number of retries to something less forgiving, like 3, using –tries.

Traditionally, when I do bulk file downloads like this, the files all live on the same server. Not so in this case; 5100 files on roughly 5100 different sites (probably not that many but still a huge amount). Thanks to compn for pointing out that I should parallelize the task in order to maximize my current downstream bandwidth which peaks at about 800 kbytes/sec sustained. So I broke the same list up into 4 different text files on 4 different terminals, and things moved a lot more efficiently.

How I processed them comes in a separate post. Now for the challenge of downloading the lists of uncommon A/V codec samples. We received 2 lists of files formatted as such:

http://somesite.com/sample.mov <hard tab> G2M3 / 0x334D3247, 1152x864
http://othersite.net/media.wmv <hard tab> G2M3 / 0x334D3247, 1280x964
http://thatsite.org/file.avi <hard tab> fraps, yuv420p, 1024x768

Again, my first impulse is to shove wget commands in front of each URL. The extra metadata at the end of each line will have to be disposed of. But I also need to sort this media into separate bins for it to really be useful.

I imported the lists into a spreadsheet and tried to manipulate them there. That’s when I remembered that I know even less about spreadsheet programs than I do about shell scripting. The lists are grouped by types of codecs (lots of G2M3 samples, then lots of fraps samples, etc), and each block is delimited by a column header line that starts with “url”. At first, I was thinking of turning the list into a glorified shell script that would create a directory named after the codec for the next group when it encountered a “url” line and then download the group’s files into that directory. But that would be a lot of manual, error-prone labor. Plus, learning from the previous process, the script should be able to download more than one file at once. If I’m going to automate that kind of task, why not just operate directly on the unmodified lists?

So it came to be: a script named download-lots-of-samples.py which iterates through the list and downloads up to 4 files in parallel using wget. The parallelization mechanism is rather sloppy (maintain a list of 4 open wget processes; if all are full, poll-sleep-poll the process states until one of them finishes) which might be improved by a native Python HTTP downloader that could leverage the select I/O facility. But I decided to just let wget do what wget does best. Further, the script catalogs the samples by the relevant part of the metadata.

A new requirement emerged after I kicked off the script: My cable broadband provider, Comcast, quite infamously imposed a 250 GB/month total bandwidth cap under penalty of summary service termination. And unlike the samples on the crasher list, the samples on the uncommon lists are generally very large and are being served by some very bandwidth-rich servers. I started to worry about the possibility of exhausting my limit after what appeared to be an entire night of my connection working at full tear and scarcely making a dent in the lists. I calculate it would only take about 3.5 days of 800 kbytes/sec to hit the limit.

At first, I hacked the script to blacklist a few of the metadata types I felt had enough samples already and then restarted, but that wasn’t a very scalable solution. Instead, the script now automatically skips entries if there are already enough samples collected (5 is the limit I imposed).

For the curious, I have included the script. I don’t know what Flameeyes is talking about when he laments the unreadability of Python. I mean, what could possibly be unclear about the statement ‘type = type[:type.find(“/”)]’? Actually, I can’t believe that I thought of that one, or that it actually does exactly what I needed.

10 thoughts on “Designing A Download Strategy

  1. Kostya

    Heh, and I always used wget –limit-rate=32k -i fileslist to download a list of files. And yes, I had to download myriads (that’s Greek for ten thousand) of files – worked fine so far.
    Also “cat list | awk ‘//{print $1;}’ > list2” will leave only the first column in the list.
    Why bother with other scripting languages when standard tools do the same for free :)

  2. Multimedia Mike Post author

    wget is quite the flexible program. But I don’t see a way to download multiple files with one invocation, which would be useful for my problem (at least the first one). Further, I wasn’t really interested in limiting the bandwidth for the second problem, just ensuring that I would never exceed my monthly allotment.

    I tried your awk one-liner. It tells me “-bash: }’: command not found”.

    Anyway, just wait until you see my method for testing the crashers and my reasoning behind it. Makes this exercise look sane. :-)

  3. Corey

    If bash complained, I’ll bet you didn’t get single quotes around awk’s argument, or perhaps the “pretty quotes” (whatever they’re called) didn’t copy and paste correctly.

    Alternatives that may or may not be portable:
    awk ‘{print $1}’ list2
    sed ‘s/\t.*//’ list2
    cut -f1 list2

  4. Reimar

    Huh? If you pass multiple URLs to wget it downloads them all, the only problem is that it can’t download them in parallel (I used that only two days ago).
    I don’t know how it handles the options though in that case, never used that (and the man page does not seem to be aware of that feature).
    And of course the right solution to the download cap is to support your ISP by paying for one of the overpriced “business” plans :-P

  5. Diego "Flameeyes" Pettenò

    Easy way to parallelise this with just standard shell tools:

    xargs -d ‘\n’ -P8 wget –continue -t3 …. < list

    where list is a newline separated list of URLs.

    The -P option allows for xargs to run parallel downloads. I use it heavily when I have to download or process stuff and make would be overkill (but for FFmpeg-based conversion, make is just fine).

  6. Diego "Flameeyes" Pettenò

    That technique might work for working around artificial limitation on per-connection basis, but a) if the remote server has not enough bandwidth, it’s likely to become slower and b) if the server is well configured either the final bandwidth usage is going to be just the same, or you’re going to get banned since a lot of web admins dislike that, for good reasons.

    Downloading from more servers at once is most likely to maximise bandwidth usage.

  7. Multimedia Mike Post author

    Yeah, I should have caught the pretty-printed quotes. Silly mistake.

    Otherwise, thanks to all for the education on basic shell utilities. That’s why I enjoy posting my sub-optimal solutions– I learn so much in response. :-)

  8. Multimedia Mike Post author

    In fact, I did use –continue all those years. But I had a special acceleration facility: Copy & paste.

Comments are closed.