{"id":3077,"date":"2010-12-31T19:40:34","date_gmt":"2011-01-01T03:40:34","guid":{"rendered":"http:\/\/multimedia.cx\/eggs\/?p=3077"},"modified":"2010-12-31T19:40:34","modified_gmt":"2011-01-01T03:40:34","slug":"a-better-process-runner","status":"publish","type":"post","link":"https:\/\/multimedia.cx\/eggs\/a-better-process-runner\/","title":{"rendered":"A Better Process Runner"},"content":{"rendered":"<p>I was recently processing a huge corpus of data. It went like this: For each file in a large set, run <code>'cmdline-tool &lt;file&gt;'<\/code>, capture the output and log results to a database, including whether the tool crashed. I wrote it in Python. I have done this exact type of the thing enough times in Python that I&#8217;m starting to notice a pattern.<\/p>\n<p>Every time I start writing such a program, I always begin with using <a href=\"http:\/\/docs.python.org\/library\/commands.html\">Python&#8217;s commands module<\/a> because it&#8217;s the easiest thing to do. Then I always have to abandon the module when I remember the hard way that whatever &#8216;cmdline-tool&#8217; is, it might run errant and try to execute forever. That&#8217;s when I import (rather, copy over) my process runner from FATE, the one that is able to kill a process after it has been running too long. I have used this module enough times that I wonder if I should spin it off into a new Python module.<\/p>\n<p>Or maybe I&#8217;m going about this the wrong way. Perhaps when the data set reaches a certain size, I&#8217;m really supposed to throw it on some kind of distributed cluster rather than task it to a Python script (a multithreaded one, to be sure, but one that runs on a single machine). Running the job on a distributed architecture wouldn&#8217;t obviate the need for such early termination. But hopefully, such architectures already have that functionality built in. It&#8217;s something to research in the new year.<\/p>\n<p>I guess there are also process limits, enforced by the shell. I don&#8217;t think I have ever gotten those to work correctly, though.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was recently processing a huge corpus of data. It went like this: For each file in a large set, run &#8216;cmdline-tool &lt;file&gt;&#8217;, capture the output and log results to a database, including whether the tool crashed. I wrote it in Python. I have done this exact type of the thing enough times in Python [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[],"class_list":["post-3077","post","type-post","status-publish","format-standard","hentry","category-python"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/3077","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/comments?post=3077"}],"version-history":[{"count":2,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/3077\/revisions"}],"predecessor-version":[{"id":3079,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/posts\/3077\/revisions\/3079"}],"wp:attachment":[{"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/media?parent=3077"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/categories?post=3077"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/multimedia.cx\/eggs\/wp-json\/wp\/v2\/tags?post=3077"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}