Process of Confusion | Breaking Eggs And Making Omelettes

I am working hard at designing a better FATE right now. But first thing’s first: I’m revisiting an old problem and hoping to conclusively determine certain process-related behavior.

I first described the problem in this post and claimed in this post that I had hacked around the problem. Here’s the thing: When I spin off a new process to run an FFmpeg command line, Python’s process object specifies a PID. Who does this PID belong to? The natural assumption would be that it belongs to FFmpeg. However, I learned empirically that it actually belongs to a shell interpreter that is launching the FFmpeg command line, which has a PID 1 greater than the shell interpreter. So my quick and dirty solution was to assume that the actual FFmpeg PID was 1 greater than the PID returned from Python’s subprocess.Popen() call.

Bad assumption. The above holds true for Linux but not for Mac OS X, where the FFmpeg command line has the returned PID. I’m not sure what Windows does.

This all matters for the timeout killer. FATE guards against the possibility of infinite loops by specifying a timeout for each test. Timeouts don’t do much good when they trigger TERM and KILL signals to the wrong PID. I tested my process runner carefully when first writing FATE (on Linux) and everything worked okay with using the same PID returned by the API. I think that was because I was testing the process runner using the built-in ‘sleep’ shell command. This time, I wrote a separate program called ‘hangaround’ that takes a number of seconds to hang around before exiting. This is my testing methodology:

From another command line:

$ ps ax|grep hangaround
21433 pts/2    S+     0:00 /bin/sh -c ./hangaround 30
21434 pts/2    S+     0:00 ./hangaround 30
21436 pts/0    R+     0:00 grep hangaround

That’s Linux; for Mac OS X:

>>> process.pid
82079

$ ps ax|grep hangaround
82079 s005  S+     0:00.01 ./hangaround 30
82084 s006  R+     0:00.00 grep hangaround

So, the upshot is that I’m a little confused about how I’m going to create a general solution to work around this problem– a problem that doesn’t occur very often but makes FATE fail hard when it does show up.

Followup:

Process Runner Redux: Making some hard decisions regarding these problems

20 thoughts on “Process of Confusion”

SvdB November 12, 2009 at 1:00 am

Maybe specifying “exec ./hangaround 30” as the command would work? It should on Linux, but I don’t know what would happen on OSX. Worst case, you could detect whether you need to add “exec”, and adjust your command accordingly.

And how about the process group? Does Popen() create a new one? Otherwise, you could just kill the thing using the pgid, instead of using the pid.

Another option would be to make ffmpeg listen to SIGHUP, which should be sent when the parent dies. That way, you could just kill the shell. Though I’m not sure what would happen if there is no controlling terminal.

Just a couple of things off the top of my head; I haven’t actually tested these ideas.

Michael Kostylev November 12, 2009 at 1:59 am

A general solution looks like reinventing a small part of procps. Futhermore, a hanging proccess might be on remote side, so killing local [rs]sh wouldn’t be enough.

Cd-MaN November 12, 2009 at 4:13 am

Looking over the subprocess documentation ( http://docs.python.org/library/subprocess.html ), it seems to take an “executable” argument. So I would assume that the way to avoid spawning an extra shell, you would run it like this:

subprocess.Popen(“./hangaround 30″,
executable=”./hangaround”,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)

Tomer Gabel November 12, 2009 at 4:56 am

If all else fails, you can always return the PID via the shell command’s return code/stdout, something like

./hangaround 30& && exit $$

Not sure that’s the correct command line syntax, but I’m pretty sure something like it would work correctly.

Mans November 12, 2009 at 7:01 am

Here’s a more radical idea: add an option to ffmpeg for limiting the runtime. This would be more widely useful than just for FATE.

Mans November 12, 2009 at 7:02 am

@Tomer: that won’t work since the process exit status is limited to 8 bits.

Short Circuit November 12, 2009 at 7:19 am

On Win32, the kosher approach is to Wait (as in WaitForSingleObject) on the process handle for the process you’re waiting on. WaitForSingleObject takes a timeout (in milliseconds), and will return WAIT_TIMEOUT if the timeout expires before the process. If the process expires fist, you would get WAIT_OBJECT_0 for a return value.

To get the process handle, you could enumerate the list of running processes, find your ffmpeg process, and get the process handle for that process.

These functions are syscalls, so if you have Python load NT’s kernel DLL and get the exported functions, you’ll have acccess to them. Chances are, ActivePython already has a module for it.

Sucks that you might have to do it differently on Win32 vs other systems, though. You might try looking at NT’s POSIX subsystem. I remember looking at that and being thoroughly disgusted by it, but I believe of the problems I faced came from having relatively non-technical users.

Raymond Tau November 12, 2009 at 8:46 am

Well, I’ve tried this on Linux. If you pass the args as a sequence of parameters, and use shell=False, the PID returned is correct.

That is, something like this:
import subprocess
process = subprocess.Popen([“sleep”, “30”],
# shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
print process.pid

Multimedia Mike Post authorNovember 12, 2009 at 9:39 am

@Mans: Your idea of limiting the runtime sort of assumes that FFmpeg will never get into an errant infinite loop which is exactly what Iâ€™m trying to guard against. This has happened at least once on FATEâ€™s watch. Unfortunately, the full {MAKETEST} regression suite, with its 1-hour timeout, was â€” and continues to be â€” the weak link here. I canâ€™t wait to break that up into smaller, saner tests with minimal timeouts.

Multimedia Mike Post authorNovember 12, 2009 at 9:40 am

@Raymond: Indeed, I should have mentioned that there is the option on breaking up the args into a sequence so that the shell isnâ€™t invoked. However, thereâ€™s the minor issue of, wellâ€¦ I just donâ€™t want to manually break up the command line strings into sequences of arguments. It sounds error-prone, unless thereâ€™s a standard Python API to take care of it. Is there a standard Python API to take care of it? Sure, it seems like a matter of breaking up the string where there is whitespace. But further down the road, I plan to test metadata options which will require options like:

ffmpeg -i file -title â€œHey, is this thing on?â€ â€¦

Breaking up by spaces isnâ€™t the right thing to do there.

Alternatively, I could re-architect FATE slightly to specify tests as sequences of parameters instead of just long strings. But that also ranks quite high on my list of approaches I’d rather not take.

Multimedia Mike Post authorNovember 12, 2009 at 11:21 am

Here’s another brainstorm: Ask FFmpeg to print out its own PID (perhaps only when a new option is specified). FATE can parse this and use it in the timeout killer. As a bonus, this would reveal the PID of a process running on a remote platform via SSH and facilitate a timeout kill via a separate SSH process.

SvdB November 12, 2009 at 3:17 pm

Hereâ€™s another brainstorm: Ask FFmpeg to print out its own PID (perhaps only when a new option is specified).

You wouldn’t need to modify FFmpeg for that. You could invoke a shell script which starts ffmpeg in the background, and then returns its PID:
./hangaround 30 &
echo $!

Multimedia Mike Post authorNovember 12, 2009 at 3:24 pm

I really don’t like wrapping these test specs in too much shell magic. Every layer introduces possible problems when, e.g., running remotely, running on Windows, or running on some other systems that we haven’t considered yet (BeOS/Haiku?).

Mans November 12, 2009 at 4:25 pm

@Mike: Read about setrlimit(). You can ask the OS to kill a process after it has used a set amount of CPU time. Maybe I should just send a patch…

Multimedia Mike Post authorNovember 12, 2009 at 5:25 pm

@Mans: I’ll check out setrlimit(), especially since I feel I’m running short on other workable ideas. Do you suppose it will work for Windows?

Mans November 12, 2009 at 5:27 pm

No idea. I can tell you that it must be done on the system running ffmpeg, not the one running the FATE script.

Multimedia Mike Post authorNovember 12, 2009 at 5:35 pm

It turns out that Python has a module for interfacing to getrlimit() and setrlimit(). Wouldn’t you know, the module is only advertised to be available on Unix. Also, per my reading of the documentation, setrlimit() allows restriction on the amount of CPU time a process gets. I’m pretty sure that refers to total CPU running time, not wall clock time. If FFmpeg got into some I/O blocked state, it would likely never hit the timeout. Further, this would be useless (if my theory is correct) in the remote execution case where SSH is just sitting around waiting for the remote process to finish, maybe processing a few NO-OP packets here and there.

Reimar November 12, 2009 at 6:08 pm

Well, for remote execution you’d have to do even more it seems, you’d have to kill the remote ffmpeg and the local ssh.
Of course the thing is that “normally” killing the shell should also kill ssh and/or ffmpeg, so it might make more sense to investigate how you can ensure that this happens and just kill the shell.

Raymond Tau November 13, 2009 at 4:31 am

@Multimedia: Actually, I am not certain if it works in Windows or not. However, I guess it should works for most of *nix machines.

Mans November 14, 2009 at 10:04 am

I sent a patch for a -timelimit option. Let’s see how that goes down.

Comments are closed.