{"id":3564,"date":"2011-09-07T21:29:30","date_gmt":"2011-09-08T04:29:30","guid":{"rendered":"http:\/\/multimedia.cx\/eggs\/?p=3564"},"modified":"2011-09-07T21:36:59","modified_gmt":"2011-09-08T04:36:59","slug":"playing-with-file","status":"publish","type":"post","link":"https:\/\/multimedia.cx\/eggs\/playing-with-file\/","title":{"rendered":"Playing With File"},"content":{"rendered":"<p>I played with <a href=\"http:\/\/en.wikipedia.org\/wiki\/File_(command)\">the &#8216;file&#8217; utility<\/a> a long time ago because I wanted to make it recognize a large number of multimedia formats. I had trouble getting my changes to take. But I&#8217;m prepared to try again after many years.<\/p>\n<p><strong>Aiming at the Corpus<\/strong><br \/>\nIn my <a href=\"http:\/\/multimedia.cx\/eggs\/curator-of-the-samples-archive\/\">local mirror of the MPlayerHQ samples archive<\/a>, I find 9853 unique files. So I run all of them through the &#8216;file&#8217; command:<\/p>\n<pre>\r\n  'find \/path\/to\/samples -type f -print0 | xargs -0 file --no-pad'\r\n<\/pre>\n<p>My Ubuntu installation has file v5.04. I also tested against 5.07 and the latest, 5.08. Here is the number of files each version was unable to identify (generically marking as &#8216;data&#8217;):<\/p>\n<pre>\r\n5.04  1521\r\n5.07  1405\r\n5.08  1501\r\n<\/pre>\n<p>That seems like a regression for v5.08 until I dug into the details and saw quite a few items like this, indicating that the MPEG detection could use some work:<\/p>\n<pre>\r\n-mov\/mov-demux-infinite-loop.mpg: DOS-executable (\r\n+mov\/mov-demux-infinite-loop.mpg: data\r\n-image-samples\/UNeedQT4.pntg: DOS-executable (\r\n+image-samples\/UNeedQT4.pntg: data\r\n<\/pre>\n<p><strong>Workflow<\/strong><br \/>\nThese are just notes to myself and perhaps anyone else who wants to add new file formats to be identified by the &#8216;file&#8217; command.<\/p>\n<p>First, download either the <a href=\"ftp:\/\/ftp.astron.com\/pub\/file\/\">latest release from the FTP<\/a> or <a href=\"https:\/\/github.com\/glensc\/file\">clone from Github<\/a>. Do the usual unpack, &#8216;.\/configure&#8217;, &#8216;make&#8217; routine. To use this newly-built version and its associated magic file:<\/p>\n<pre>\r\n  .\/src\/file --magic-file magic\/magic.mgc &lt;file&gt;\r\n<\/pre>\n<p>To add a new format for ID, first, run the foregoing command to ensure that it&#8217;s not already identified. Then, check over the files in magic\/Magdir and see which one might pertain to what you&#8217;re doing (it&#8217;s unlikely that your format will merit a new file in this directory). For example, for this round, I modified animation, audio, iff, and riff. Add or modify existing specs based on the copious examples in the directory and by consulting the appropriate man page (&#8216;man 5 magic&#8217;).<\/p>\n<p>Finally, run &#8216;make&#8217; again which will regenerate the magic file. Invoke the above command again to use the modified magic file.<\/p>\n<p><strong>Before and After<\/strong><br \/>\nOn a selection of formats taken from the samples archive (renamed and cut down to a kilobyte because detection typically only relies on the first few bytes), here is the &#8220;before&#8221;:<\/p>\n<pre>\r\namv:            RIFF (little-endian) data\r\narmovie:        data\r\nbbc-dirac:      data\r\ninterplay-mve:  data\r\nmtv:            data\r\nnintendo-thp:   data\r\nnullsoft-video: data\r\nredcode:        data\r\nsega-film:      data\r\nsmacker:        data\r\ntrueaudio:      data\r\nvqa:            IFF data\r\nwavpack:        data\r\nwc3-mve:        IFF data\r\nwtv:            data\r\n<\/pre>\n<p>And the &#8220;after&#8221;:<\/p>\n<pre>\r\namv:            RIFF (little-endian) data, AMV \r\narmovie:        ARMovie\r\nbbc-dirac:      BBC Dirac Video\r\ninterplay-mve:  Interplay MVE Movie\r\nmtv:            MTV Multimedia File\r\nnintendo-thp:   Nintendo THP Multimedia\r\nnullsoft-video: Nullsoft Video\r\nredcode:        REDCode Video\r\nsega-film:      Sega FILM\/CPK Multimedia, 320 x 224\r\nsmacker:        RAD Game Tools Smacker Multimedia version 2, 320 x 200, 100 frames\r\ntrueaudio:      True Audio Lossless Audio\r\nvqa:            IFF data, Westwood Studios VQA Multimedia, 418 video frames, 320 x 200\r\nwavpack:        WavPack Lossless Audio\r\nwc3-mve:        IFF data, Wing Commander III Video, PC version\r\nwtv:            Windows Television DVR Media\r\n<\/pre>\n<p>After rerunning &#8216;file&#8217; on the mphq corpus using the modified magic file, only 1329 files remain unidentified (down from 1501).<\/p>\n<p><strong>Going Forward<\/strong><br \/>\nAs mentioned, MPEG detection could probably be strengthened. However, a major weakness is QuickTime\/MP4. Many files are not detected, probably owing to the many ways that QuickTime files can begin.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Making the &#8216;file&#8217; utility recognize a few more multimedia formats<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3564","post","type-post","status-publish","format-standard","hentry","category-general"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/3564","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/comments?post=3564"}],"version-history":[{"count":11,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/3564\/revisions"}],"predecessor-version":[{"id":3576,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/3564\/revisions\/3576"}],"wp:attachment":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/media?parent=3564"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/categories?post=3564"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/tags?post=3564"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}