Process of Confusion

I am working hard at designing a better FATE right now. But first thing’s first: I’m revisiting an old problem and hoping to conclusively determine certain process-related behavior.

I first described the problem in this post and claimed in this post that I had hacked around the problem. Here’s the thing: When I spin off a new process to run an FFmpeg command line, Python’s process object specifies a PID. Who does this PID belong to? The natural assumption would be that it belongs to FFmpeg. However, I learned empirically that it actually belongs to a shell interpreter that is launching the FFmpeg command line, which has a PID 1 greater than the shell interpreter. So my quick and dirty solution was to assume that the actual FFmpeg PID was 1 greater than the PID returned from Python’s subprocess.Popen() call.

Bad assumption. The above holds true for Linux but not for Mac OS X, where the FFmpeg command line has the returned PID. I’m not sure what Windows does.

This all matters for the timeout killer. FATE guards against the possibility of infinite loops by specifying a timeout for each test. Timeouts don’t do much good when they trigger TERM and KILL signals to the wrong PID. I tested my process runner carefully when first writing FATE (on Linux) and everything worked okay with using the same PID returned by the API. I think that was because I was testing the process runner using the built-in ‘sleep’ shell command. This time, I wrote a separate program called ‘hangaround’ that takes a number of seconds to hang around before exiting. This is my testing methodology:

From another command line:

$ ps ax|grep hangaround
21433 pts/2    S+     0:00 /bin/sh -c ./hangaround 30
21434 pts/2    S+     0:00 ./hangaround 30
21436 pts/0    R+     0:00 grep hangaround

That’s Linux; for Mac OS X:

>>> process.pid
82079

$ ps ax|grep hangaround
82079 s005  S+     0:00.01 ./hangaround 30
82084 s006  R+     0:00.00 grep hangaround

So, the upshot is that I’m a little confused about how I’m going to create a general solution to work around this problem– a problem that doesn’t occur very often but makes FATE fail hard when it does show up.

Followup:

20 thoughts on “Process of Confusion

  1. SvdB

    Maybe specifying “exec ./hangaround 30” as the command would work? It should on Linux, but I don’t know what would happen on OSX. Worst case, you could detect whether you need to add “exec”, and adjust your command accordingly.

    And how about the process group? Does Popen() create a new one? Otherwise, you could just kill the thing using the pgid, instead of using the pid.

    Another option would be to make ffmpeg listen to SIGHUP, which should be sent when the parent dies. That way, you could just kill the shell. Though I’m not sure what would happen if there is no controlling terminal.

    Just a couple of things off the top of my head; I haven’t actually tested these ideas.

  2. Michael Kostylev

    A general solution looks like reinventing a small part of procps. Futhermore, a hanging proccess might be on remote side, so killing local [rs]sh wouldn’t be enough.

  3. Cd-MaN

    Looking over the subprocess documentation ( http://docs.python.org/library/subprocess.html ), it seems to take an “executable” argument. So I would assume that the way to avoid spawning an extra shell, you would run it like this:

    subprocess.Popen(“./hangaround 30″,
    executable=”./hangaround”,
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE)

  4. Tomer Gabel

    If all else fails, you can always return the PID via the shell command’s return code/stdout, something like

    ./hangaround 30& && exit $$

    Not sure that’s the correct command line syntax, but I’m pretty sure something like it would work correctly.

  5. Mans

    Here’s a more radical idea: add an option to ffmpeg for limiting the runtime. This would be more widely useful than just for FATE.

  6. Short Circuit

    On Win32, the kosher approach is to Wait (as in WaitForSingleObject) on the process handle for the process you’re waiting on. WaitForSingleObject takes a timeout (in milliseconds), and will return WAIT_TIMEOUT if the timeout expires before the process. If the process expires fist, you would get WAIT_OBJECT_0 for a return value.

    To get the process handle, you could enumerate the list of running processes, find your ffmpeg process, and get the process handle for that process.

    These functions are syscalls, so if you have Python load NT’s kernel DLL and get the exported functions, you’ll have acccess to them. Chances are, ActivePython already has a module for it.

    Sucks that you might have to do it differently on Win32 vs other systems, though. You might try looking at NT’s POSIX subsystem. I remember looking at that and being thoroughly disgusted by it, but I believe of the problems I faced came from having relatively non-technical users.

  7. Raymond Tau

    Well, I’ve tried this on Linux. If you pass the args as a sequence of parameters, and use shell=False, the PID returned is correct.

    That is, something like this:
    import subprocess
    process = subprocess.Popen([“sleep”, “30”],
    # shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE)
    print process.pid

  8. Multimedia Mike Post author

    @Mans: Your idea of limiting the runtime sort of assumes that FFmpeg will never get into an errant infinite loop which is exactly what I’m trying to guard against. This has happened at least once on FATE’s watch. Unfortunately, the full {MAKETEST} regression suite, with its 1-hour timeout, was — and continues to be — the weak link here. I can’t wait to break that up into smaller, saner tests with minimal timeouts.

  9. Multimedia Mike Post author

    @Raymond: Indeed, I should have mentioned that there is the option on breaking up the args into a sequence so that the shell isn’t invoked. However, there’s the minor issue of, well… I just don’t want to manually break up the command line strings into sequences of arguments. It sounds error-prone, unless there’s a standard Python API to take care of it. Is there a standard Python API to take care of it? Sure, it seems like a matter of breaking up the string where there is whitespace. But further down the road, I plan to test metadata options which will require options like:

    ffmpeg -i file -title “Hey, is this thing on?” …

    Breaking up by spaces isn’t the right thing to do there.

    Alternatively, I could re-architect FATE slightly to specify tests as sequences of parameters instead of just long strings. But that also ranks quite high on my list of approaches I’d rather not take.

  10. Multimedia Mike Post author

    Here’s another brainstorm: Ask FFmpeg to print out its own PID (perhaps only when a new option is specified). FATE can parse this and use it in the timeout killer. As a bonus, this would reveal the PID of a process running on a remote platform via SSH and facilitate a timeout kill via a separate SSH process.

  11. SvdB

    Here’s another brainstorm: Ask FFmpeg to print out its own PID (perhaps only when a new option is specified).

    You wouldn’t need to modify FFmpeg for that. You could invoke a shell script which starts ffmpeg in the background, and then returns its PID:
    ./hangaround 30 &
    echo $!

  12. Multimedia Mike Post author

    I really don’t like wrapping these test specs in too much shell magic. Every layer introduces possible problems when, e.g., running remotely, running on Windows, or running on some other systems that we haven’t considered yet (BeOS/Haiku?).

  13. Mans

    @Mike: Read about setrlimit(). You can ask the OS to kill a process after it has used a set amount of CPU time. Maybe I should just send a patch…

  14. Multimedia Mike Post author

    @Mans: I’ll check out setrlimit(), especially since I feel I’m running short on other workable ideas. Do you suppose it will work for Windows?

  15. Mans

    No idea. I can tell you that it must be done on the system running ffmpeg, not the one running the FATE script.

  16. Multimedia Mike Post author

    It turns out that Python has a module for interfacing to getrlimit() and setrlimit(). Wouldn’t you know, the module is only advertised to be available on Unix. Also, per my reading of the documentation, setrlimit() allows restriction on the amount of CPU time a process gets. I’m pretty sure that refers to total CPU running time, not wall clock time. If FFmpeg got into some I/O blocked state, it would likely never hit the timeout. Further, this would be useless (if my theory is correct) in the remote execution case where SSH is just sitting around waiting for the remote process to finish, maybe processing a few NO-OP packets here and there.

  17. Reimar

    Well, for remote execution you’d have to do even more it seems, you’d have to kill the remote ffmpeg and the local ssh.
    Of course the thing is that “normally” killing the shell should also kill ssh and/or ffmpeg, so it might make more sense to investigate how you can ensure that this happens and just kill the shell.

  18. Raymond Tau

    @Multimedia: Actually, I am not certain if it works in Windows or not. However, I guess it should works for most of *nix machines.

Comments are closed.