I am working hard at designing a better FATE right now. But first thing’s first: I’m revisiting an old problem and hoping to conclusively determine certain process-related behavior.
I first described the problem in this post and claimed in this post that I had hacked around the problem. Here’s the thing: When I spin off a new process to run an FFmpeg command line, Python’s process object specifies a PID. Who does this PID belong to? The natural assumption would be that it belongs to FFmpeg. However, I learned empirically that it actually belongs to a shell interpreter that is launching the FFmpeg command line, which has a PID 1 greater than the shell interpreter. So my quick and dirty solution was to assume that the actual FFmpeg PID was 1 greater than the PID returned from Python’s subprocess.Popen() call.
Bad assumption. The above holds true for Linux but not for Mac OS X, where the FFmpeg command line has the returned PID. I’m not sure what Windows does.
This all matters for the timeout killer. FATE guards against the possibility of infinite loops by specifying a timeout for each test. Timeouts don’t do much good when they trigger TERM and KILL signals to the wrong PID. I tested my process runner carefully when first writing FATE (on Linux) and everything worked okay with using the same PID returned by the API. I think that was because I was testing the process runner using the built-in ‘sleep’ shell command. This time, I wrote a separate program called ‘hangaround’ that takes a number of seconds to hang around before exiting. This is my testing methodology:
From another command line:
$ ps ax|grep hangaround 21433 pts/2 S+ 0:00 /bin/sh -c ./hangaround 30 21434 pts/2 S+ 0:00 ./hangaround 30 21436 pts/0 R+ 0:00 grep hangaround
That’s Linux; for Mac OS X:
>>> process.pid 82079 $ ps ax|grep hangaround 82079 s005 S+ 0:00.01 ./hangaround 30 82084 s006 R+ 0:00.00 grep hangaround
So, the upshot is that I’m a little confused about how I’m going to create a general solution to work around this problem– a problem that doesn’t occur very often but makes FATE fail hard when it does show up.
Followup:
- Process Runner Redux: Making some hard decisions regarding these problems
Maybe specifying “exec ./hangaround 30” as the command would work? It should on Linux, but I don’t know what would happen on OSX. Worst case, you could detect whether you need to add “exec”, and adjust your command accordingly.
And how about the process group? Does Popen() create a new one? Otherwise, you could just kill the thing using the pgid, instead of using the pid.
Another option would be to make ffmpeg listen to SIGHUP, which should be sent when the parent dies. That way, you could just kill the shell. Though I’m not sure what would happen if there is no controlling terminal.
Just a couple of things off the top of my head; I haven’t actually tested these ideas.
A general solution looks like reinventing a small part of procps. Futhermore, a hanging proccess might be on remote side, so killing local [rs]sh wouldn’t be enough.
Looking over the subprocess documentation ( http://docs.python.org/library/subprocess.html ), it seems to take an “executable” argument. So I would assume that the way to avoid spawning an extra shell, you would run it like this:
subprocess.Popen(“./hangaround 30″,
executable=”./hangaround”,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
If all else fails, you can always return the PID via the shell command’s return code/stdout, something like
./hangaround 30& && exit $$
Not sure that’s the correct command line syntax, but I’m pretty sure something like it would work correctly.
Here’s a more radical idea: add an option to ffmpeg for limiting the runtime. This would be more widely useful than just for FATE.
@Tomer: that won’t work since the process exit status is limited to 8 bits.
On Win32, the kosher approach is to Wait (as in WaitForSingleObject) on the process handle for the process you’re waiting on. WaitForSingleObject takes a timeout (in milliseconds), and will return WAIT_TIMEOUT if the timeout expires before the process. If the process expires fist, you would get WAIT_OBJECT_0 for a return value.
To get the process handle, you could enumerate the list of running processes, find your ffmpeg process, and get the process handle for that process.
These functions are syscalls, so if you have Python load NT’s kernel DLL and get the exported functions, you’ll have acccess to them. Chances are, ActivePython already has a module for it.
Sucks that you might have to do it differently on Win32 vs other systems, though. You might try looking at NT’s POSIX subsystem. I remember looking at that and being thoroughly disgusted by it, but I believe of the problems I faced came from having relatively non-technical users.
Well, I’ve tried this on Linux. If you pass the args as a sequence of parameters, and use shell=False, the PID returned is correct.
That is, something like this:
import subprocess
process = subprocess.Popen([“sleep”, “30”],
# shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
print process.pid
@Mans: Your idea of limiting the runtime sort of assumes that FFmpeg will never get into an errant infinite loop which is exactly what I’m trying to guard against. This has happened at least once on FATE’s watch. Unfortunately, the full {MAKETEST} regression suite, with its 1-hour timeout, was — and continues to be — the weak link here. I can’t wait to break that up into smaller, saner tests with minimal timeouts.
@Raymond: Indeed, I should have mentioned that there is the option on breaking up the args into a sequence so that the shell isn’t invoked. However, there’s the minor issue of, well… I just don’t want to manually break up the command line strings into sequences of arguments. It sounds error-prone, unless there’s a standard Python API to take care of it. Is there a standard Python API to take care of it? Sure, it seems like a matter of breaking up the string where there is whitespace. But further down the road, I plan to test metadata options which will require options like:
ffmpeg -i file -title “Hey, is this thing on?†…
Breaking up by spaces isn’t the right thing to do there.
Alternatively, I could re-architect FATE slightly to specify tests as sequences of parameters instead of just long strings. But that also ranks quite high on my list of approaches I’d rather not take.
Here’s another brainstorm: Ask FFmpeg to print out its own PID (perhaps only when a new option is specified). FATE can parse this and use it in the timeout killer. As a bonus, this would reveal the PID of a process running on a remote platform via SSH and facilitate a timeout kill via a separate SSH process.
You wouldn’t need to modify FFmpeg for that. You could invoke a shell script which starts ffmpeg in the background, and then returns its PID:
./hangaround 30 &
echo $!
I really don’t like wrapping these test specs in too much shell magic. Every layer introduces possible problems when, e.g., running remotely, running on Windows, or running on some other systems that we haven’t considered yet (BeOS/Haiku?).
@Mike: Read about setrlimit(). You can ask the OS to kill a process after it has used a set amount of CPU time. Maybe I should just send a patch…
@Mans: I’ll check out setrlimit(), especially since I feel I’m running short on other workable ideas. Do you suppose it will work for Windows?
No idea. I can tell you that it must be done on the system running ffmpeg, not the one running the FATE script.
It turns out that Python has a module for interfacing to getrlimit() and setrlimit(). Wouldn’t you know, the module is only advertised to be available on Unix. Also, per my reading of the documentation, setrlimit() allows restriction on the amount of CPU time a process gets. I’m pretty sure that refers to total CPU running time, not wall clock time. If FFmpeg got into some I/O blocked state, it would likely never hit the timeout. Further, this would be useless (if my theory is correct) in the remote execution case where SSH is just sitting around waiting for the remote process to finish, maybe processing a few NO-OP packets here and there.
Well, for remote execution you’d have to do even more it seems, you’d have to kill the remote ffmpeg and the local ssh.
Of course the thing is that “normally” killing the shell should also kill ssh and/or ffmpeg, so it might make more sense to investigate how you can ensure that this happens and just kill the shell.
@Multimedia: Actually, I am not certain if it works in Windows or not. However, I guess it should works for most of *nix machines.
I sent a patch for a -timelimit option. Let’s see how that goes down.