The uncommon video codecs list mentioned in the last post is amazing. Here are some FourCCs I have never heard of before: 3ivd, abyr, acdv, aura, brco, bt20, bw10, cfcc, cfhd, digi, dpsh, dslv, es07, fire, g2m3, gain, geox, imm4, inmc, mohd, mplo, qivg, suvf, ty0n, xith, xplo, and zdsv. There are several that have been found to be variations of other codecs. And there are some that were only rumored to exist, such as aflc as a codec for storing FLIC data in an AVI container, and azpr as an alternate FourCC for rpza. We now have samples. The existence of many of these FourCCs has, in fact, been cataloged on FourCC.org. But I was always reticent to document the FourCCs in the MultimediaWiki unless I could find either samples or a binary codec.
But how to obtain all of these samples?
Do you ever download files from the internet? Of course you do. Do you ever download a bunch of files at a time? Maybe. But have you ever had to download a few thousand files?
I have some experience to guide me in this. I didn’t get away from dialup internet access until mid-2004, even though I had been hacking on multimedia for several years which necessitates downloading bulky samples. When I had to download a number of large files over the web, I made a text file with a series of wget commands:
wget --continue http://movietrailersite.com/trailer1.mov wget --continue http://movietrailersite.com/trailer2.mov
…etc. The –continue option on each line was useful because I could break and subsequently resume the script any time I needed my 3.5 kbytes/sec downstream capacity for a higher-priority task.
So my first impulse for downloading the list of 5100 crashers was to put ‘wget –continue’ in front of each one and let it rip. Since there are more than a few slow, confused, or misbehaving servers out there, it wasn’t long before I noticed that I should limit the number of retries to something less forgiving, like 3, using –tries.
Traditionally, when I do bulk file downloads like this, the files all live on the same server. Not so in this case; 5100 files on roughly 5100 different sites (probably not that many but still a huge amount). Thanks to compn for pointing out that I should parallelize the task in order to maximize my current downstream bandwidth which peaks at about 800 kbytes/sec sustained. So I broke the same list up into 4 different text files on 4 different terminals, and things moved a lot more efficiently.
How I processed them comes in a separate post. Now for the challenge of downloading the lists of uncommon A/V codec samples. We received 2 lists of files formatted as such:
http://somesite.com/sample.mov <hard tab> G2M3 / 0x334D3247, 1152x864 http://othersite.net/media.wmv <hard tab> G2M3 / 0x334D3247, 1280x964 http://thatsite.org/file.avi <hard tab> fraps, yuv420p, 1024x768
Again, my first impulse is to shove wget commands in front of each URL. The extra metadata at the end of each line will have to be disposed of. But I also need to sort this media into separate bins for it to really be useful.
I imported the lists into a spreadsheet and tried to manipulate them there. That’s when I remembered that I know even less about spreadsheet programs than I do about shell scripting. The lists are grouped by types of codecs (lots of G2M3 samples, then lots of fraps samples, etc), and each block is delimited by a column header line that starts with “url”. At first, I was thinking of turning the list into a glorified shell script that would create a directory named after the codec for the next group when it encountered a “url” line and then download the group’s files into that directory. But that would be a lot of manual, error-prone labor. Plus, learning from the previous process, the script should be able to download more than one file at once. If I’m going to automate that kind of task, why not just operate directly on the unmodified lists?
So it came to be: a script named download-lots-of-samples.py which iterates through the list and downloads up to 4 files in parallel using wget. The parallelization mechanism is rather sloppy (maintain a list of 4 open wget processes; if all are full, poll-sleep-poll the process states until one of them finishes) which might be improved by a native Python HTTP downloader that could leverage the select I/O facility. But I decided to just let wget do what wget does best. Further, the script catalogs the samples by the relevant part of the metadata.
A new requirement emerged after I kicked off the script: My cable broadband provider, Comcast, quite infamously imposed a 250 GB/month total bandwidth cap under penalty of summary service termination. And unlike the samples on the crasher list, the samples on the uncommon lists are generally very large and are being served by some very bandwidth-rich servers. I started to worry about the possibility of exhausting my limit after what appeared to be an entire night of my connection working at full tear and scarcely making a dent in the lists. I calculate it would only take about 3.5 days of 800 kbytes/sec to hit the limit.
At first, I hacked the script to blacklist a few of the metadata types I felt had enough samples already and then restarted, but that wasn’t a very scalable solution. Instead, the script now automatically skips entries if there are already enough samples collected (5 is the limit I imposed).
For the curious, I have included the script. I don’t know what Flameeyes is talking about when he laments the unreadability of Python. I mean, what could possibly be unclear about the statement ‘type = type[:type.find(“/”)]’? Actually, I can’t believe that I thought of that one, or that it actually does exactly what I needed.