I played with the ‘file’ utility a long time ago because I wanted to make it recognize a large number of multimedia formats. I had trouble getting my changes to take. But I’m prepared to try again after many years.
Aiming at the Corpus
In my local mirror of the MPlayerHQ samples archive, I find 9853 unique files. So I run all of them through the ‘file’ command:
'find /path/to/samples -type f -print0 | xargs -0 file --no-pad'
My Ubuntu installation has file v5.04. I also tested against 5.07 and the latest, 5.08. Here is the number of files each version was unable to identify (generically marking as ‘data’):
5.04 1521 5.07 1405 5.08 1501
That seems like a regression for v5.08 until I dug into the details and saw quite a few items like this, indicating that the MPEG detection could use some work:
-mov/mov-demux-infinite-loop.mpg: DOS-executable ( +mov/mov-demux-infinite-loop.mpg: data -image-samples/UNeedQT4.pntg: DOS-executable ( +image-samples/UNeedQT4.pntg: data
Workflow
These are just notes to myself and perhaps anyone else who wants to add new file formats to be identified by the ‘file’ command.
First, download either the latest release from the FTP or clone from Github. Do the usual unpack, ‘./configure’, ‘make’ routine. To use this newly-built version and its associated magic file:
./src/file --magic-file magic/magic.mgc <file>
To add a new format for ID, first, run the foregoing command to ensure that it’s not already identified. Then, check over the files in magic/Magdir and see which one might pertain to what you’re doing (it’s unlikely that your format will merit a new file in this directory). For example, for this round, I modified animation, audio, iff, and riff. Add or modify existing specs based on the copious examples in the directory and by consulting the appropriate man page (‘man 5 magic’).
Finally, run ‘make’ again which will regenerate the magic file. Invoke the above command again to use the modified magic file.
Before and After
On a selection of formats taken from the samples archive (renamed and cut down to a kilobyte because detection typically only relies on the first few bytes), here is the “before”:
amv: RIFF (little-endian) data armovie: data bbc-dirac: data interplay-mve: data mtv: data nintendo-thp: data nullsoft-video: data redcode: data sega-film: data smacker: data trueaudio: data vqa: IFF data wavpack: data wc3-mve: IFF data wtv: data
And the “after”:
amv: RIFF (little-endian) data, AMV armovie: ARMovie bbc-dirac: BBC Dirac Video interplay-mve: Interplay MVE Movie mtv: MTV Multimedia File nintendo-thp: Nintendo THP Multimedia nullsoft-video: Nullsoft Video redcode: REDCode Video sega-film: Sega FILM/CPK Multimedia, 320 x 224 smacker: RAD Game Tools Smacker Multimedia version 2, 320 x 200, 100 frames trueaudio: True Audio Lossless Audio vqa: IFF data, Westwood Studios VQA Multimedia, 418 video frames, 320 x 200 wavpack: WavPack Lossless Audio wc3-mve: IFF data, Wing Commander III Video, PC version wtv: Windows Television DVR Media
After rerunning ‘file’ on the mphq corpus using the modified magic file, only 1329 files remain unidentified (down from 1501).
Going Forward
As mentioned, MPEG detection could probably be strengthened. However, a major weakness is QuickTime/MP4. Many files are not detected, probably owing to the many ways that QuickTime files can begin.