Playing With File

I played with the ‘file’ utility a long time ago because I wanted to make it recognize a large number of multimedia formats. I had trouble getting my changes to take. But I’m prepared to try again after many years.

Aiming at the Corpus
In my local mirror of the MPlayerHQ samples archive, I find 9853 unique files. So I run all of them through the ‘file’ command:

  'find /path/to/samples -type f -print0 | xargs -0 file --no-pad'

My Ubuntu installation has file v5.04. I also tested against 5.07 and the latest, 5.08. Here is the number of files each version was unable to identify (generically marking as ‘data’):

5.04  1521
5.07  1405
5.08  1501

That seems like a regression for v5.08 until I dug into the details and saw quite a few items like this, indicating that the MPEG detection could use some work:

-mov/mov-demux-infinite-loop.mpg: DOS-executable (
+mov/mov-demux-infinite-loop.mpg: data
-image-samples/UNeedQT4.pntg: DOS-executable (
+image-samples/UNeedQT4.pntg: data

Workflow
These are just notes to myself and perhaps anyone else who wants to add new file formats to be identified by the ‘file’ command.

First, download either the latest release from the FTP or clone from Github. Do the usual unpack, ‘./configure’, ‘make’ routine. To use this newly-built version and its associated magic file:

  ./src/file --magic-file magic/magic.mgc <file>

To add a new format for ID, first, run the foregoing command to ensure that it’s not already identified. Then, check over the files in magic/Magdir and see which one might pertain to what you’re doing (it’s unlikely that your format will merit a new file in this directory). For example, for this round, I modified animation, audio, iff, and riff. Add or modify existing specs based on the copious examples in the directory and by consulting the appropriate man page (‘man 5 magic’).

Finally, run ‘make’ again which will regenerate the magic file. Invoke the above command again to use the modified magic file.

Before and After
On a selection of formats taken from the samples archive (renamed and cut down to a kilobyte because detection typically only relies on the first few bytes), here is the “before”:

amv:            RIFF (little-endian) data
armovie:        data
bbc-dirac:      data
interplay-mve:  data
mtv:            data
nintendo-thp:   data
nullsoft-video: data
redcode:        data
sega-film:      data
smacker:        data
trueaudio:      data
vqa:            IFF data
wavpack:        data
wc3-mve:        IFF data
wtv:            data

And the “after”:

amv:            RIFF (little-endian) data, AMV 
armovie:        ARMovie
bbc-dirac:      BBC Dirac Video
interplay-mve:  Interplay MVE Movie
mtv:            MTV Multimedia File
nintendo-thp:   Nintendo THP Multimedia
nullsoft-video: Nullsoft Video
redcode:        REDCode Video
sega-film:      Sega FILM/CPK Multimedia, 320 x 224
smacker:        RAD Game Tools Smacker Multimedia version 2, 320 x 200, 100 frames
trueaudio:      True Audio Lossless Audio
vqa:            IFF data, Westwood Studios VQA Multimedia, 418 video frames, 320 x 200
wavpack:        WavPack Lossless Audio
wc3-mve:        IFF data, Wing Commander III Video, PC version
wtv:            Windows Television DVR Media

After rerunning ‘file’ on the mphq corpus using the modified magic file, only 1329 files remain unidentified (down from 1501).

Going Forward
As mentioned, MPEG detection could probably be strengthened. However, a major weakness is QuickTime/MP4. Many files are not detected, probably owing to the many ways that QuickTime files can begin.

7 thoughts on “Playing With File

  1. Multimedia Mike Post author

    This violates file’s guidelines which want at least 32 bits of uniqueness. I guess it’s a trade-off.

    Now that I think about it, I was unaware that the COM format had any signature at all. I thought it was a flat binary file that got loaded into a 64K memory segment. But my knowledge could be a bit out of date.

  2. Alex Converse

    Another comment from me, I’m such a loser.

    It seems like quicktime/mp4 should be easy to detect
    ….ftyp, ….moov, ….mdat, ….free, ….wide should cover 99.9% of files. Am I missing something? Can you give some examples of undetectable mov/mp4 files? And far as quicktime vs mp4/j2k/etc, major brand after ftyp should cover that, anything without ftyp is pretty underspecified anyway.

  3. Multimedia Mike Post author

    Last I checked up on the QuickTime matter, many of these cases were covered but were commented out for one reason or another.

    If I were to guess, I might think that they were holding out for a better solution for getting more useful data from a QuickTime file instead of just “it’s a QuickTime file”. The file magic syntax is flexible but it still looks like a lot of work to parse and print out further information.

  4. Daniel

    DOS COM doesn’t actually have any signature – it’s just 16-bit x86 code starting at the very beginning of the file.

    ‘file’ is pretty much a joke for some formats… most of the “signatures” are just random bytes that happen to work for a single sample, and a lot of them seem to be written by inspecting a specific variant of a file type rather than reading a spec (like the Windows PE signatures, which work for 64-bit EXEs generated by MSVC but not the perfectly-valid ones generated by MinGW-w64).

    The COM signatures are rather silly; the single-byte matches at offset 0 are various x86 instructions (jmp, mov ax, etc.), and then there are some two-byte matches at random file offsets (from specific sample files, as mentioned in the magic file comments) for ‘int 21h’ (the DOS interrupt). These are very weak matches…

  5. Multimedia Mike Post author

    @Daniel: Thanks for the verification regarding the COM sig.

    I agree that ‘file’ has some pretty sketchy ‘signatures’ in its database gleaned from very small sample sets (an earlier version of ‘file’ identified any Bink file as a movie specifically from a certain Civilization game). It’s a good starting point, nonetheless.

Comments are closed.