Category Archives: Python

My Own Offline RSS Reader (Part 2)

About that “true” offline RSS reader that I pitched in my last post, I’ll have you know that I made a minimally functioning system based on that outline.

These are the primary challenges/unknowns that I assessed from the outset:

  1. Manipulating relative URLs of supporting files
  2. Parsing HTML in Python
  3. Searching and replacing within the HTML file
  4. Downloaded .js files that include other .js files

For #1, Python’s urlparse library works wonders. For #2 and #3, look no farther than Python’s HTMLParser module. This blog post helped me greatly. I have chosen not to address #4 at this time. I’m not downloading any JavaScript files right now; the CSS and supporting images are mostly adequate.

Further, it turned out not to be necessary to manually build an XML parser. Whenever I encountered a task that felt like it was going to be too much work — like manually parsing the XML feeds using Python’s low-level XML systems — a little searching revealed that all the hard work was already done. In the case of parsing the RSS files, the task was rendered trivial thanks to FeedParser.

Brief TODO list, for my own reference:

  • Index the database tables in a sane manner
  • Deal with exceptions thrown by malformed HTML
  • Update the post table to indicate that a post has been “read” when it is accessed
  • Implement HTTP redirection (since some RSS feeds apparently do that)
  • Implement cache control so that the browser will properly refresh feed lists
  • Add a stylesheet that will allow the server to control the appearance of links depending on whether or not the posts have been read
  • Take into account non-ASCII encoding (really need to train myself to do this from the get-go)
  • Forge user agent and referrer strings in HTTP requests, for good measure
  • Slap some kind of UI prettiness on top of the whole affair; I’m thinking an accordian widget containing tables might work well and I think there are a number of JavaScript libraries that could make that happen

Once I get that far, I’ll probably put some code out there. Based on what I have read, I’m not the only person who is looking for a solution like this.

I eventually released this software. Find it on Github.

Process Runner Redux

Pursuant to yesterday’s conundrum of creating a portable process runner in Python for FATE that can be reliably killed when exceeding time constraints, I settled on a solution. As Raymond Tau reminded us in the ensuing discussion, Python won’t use a shell to launch the process if the program can supply the command and its arguments as a sequence data structure. I knew this but was intentionally avoiding it. It seems like a simple problem to break up a command line into a sequence of arguments– just split on spaces. However, I hope to test metadata options eventually which could include arguments such as ‘-title “Hey, is this thing on?”‘ where splitting on spaces clearly isn’t the right solution.

I got frustrated enough with the problem that I decided to split on spaces anyway. Hey, I control this system from top to bottom, so new rule: No command line arguments in test specs will have spaces with quotes around them. I already enforce the rule that no sample files can have spaces in their filenames since that causes trouble with remote testing. When I get to the part about testing metadata, said metadata will take the form of ‘-title “HeyIsThisThingOn?”‘ (which will then fail to catch myriad bugs related to FFmpeg’s incorrect handling of whitespace in metadata arguments, but this is all about trade-offs).

So the revised Python process runner seems to work correctly on Linux. The hangaround.c program simulates a badly misbehaving program by eating the TERM signal and must be dealt with using the KILL signal. The last line in these examples is a tuple containing return code, stdout, stderr, and CPU time. For Linux:

$ ./upr.py 
['./hangaround', '40']
process ID = 2645
timeout, sending TERM
timeout, really killing
[-9, '', '', 0]

The unmodified code works the same on Mac OS X:

$ ./upr.py
['./hangaround', '40']
process ID = 94866
timeout, sending TERM
timeout, really killing
[-9, '', '', 0]

Now a bigger test: Running the upr.py script on Linux in order to launch the hangaround process remotely on Mac OS X via SSH:

$ ./upr.py 
['/usr/bin/ssh', 'foster-home', './hangaround', '40']
process ID = 2673
timeout, sending TERM
[143, '', '', 50]

So that’s good… sort of. Monitoring the process on the other end reveals that hangaround is still doing just that, even after SSH goes away. This occurs whether or not hangaround is ignoring the TERM signal. This is still suboptimal.

It would be possible to open a separate SSH session to send a TERM or KILL signal to the original process… except that I wouldn’t know the PID of the remote process. Or could I? I’m open to Unix shell magic tricks on this problem since anything responding to SSH requests is probably going to be acceptably Unix-like. I would rather not go the ‘killall ffmpeg’ route because that could interfere with some multiprocessing ideas I’m working on.

Here’s a brute force brainstorm: When operating in remote-SSH mode, prefix the command with ‘ln -s ffmpeg ffmpeg-<unique-key>’ and then execute the symbolic link instead of the main binary. Then the script should be able to open a separate SSH session and execute ‘killall ffmpeg-<unique-key>’ without interfering with other processes. Outlandish but possibly workable.

Process of Confusion

I am working hard at designing a better FATE right now. But first thing’s first: I’m revisiting an old problem and hoping to conclusively determine certain process-related behavior.

I first described the problem in this post and claimed in this post that I had hacked around the problem. Here’s the thing: When I spin off a new process to run an FFmpeg command line, Python’s process object specifies a PID. Who does this PID belong to? The natural assumption would be that it belongs to FFmpeg. However, I learned empirically that it actually belongs to a shell interpreter that is launching the FFmpeg command line, which has a PID 1 greater than the shell interpreter. So my quick and dirty solution was to assume that the actual FFmpeg PID was 1 greater than the PID returned from Python’s subprocess.Popen() call.

Bad assumption. The above holds true for Linux but not for Mac OS X, where the FFmpeg command line has the returned PID. I’m not sure what Windows does.

This all matters for the timeout killer. FATE guards against the possibility of infinite loops by specifying a timeout for each test. Timeouts don’t do much good when they trigger TERM and KILL signals to the wrong PID. I tested my process runner carefully when first writing FATE (on Linux) and everything worked okay with using the same PID returned by the API. I think that was because I was testing the process runner using the built-in ‘sleep’ shell command. This time, I wrote a separate program called ‘hangaround’ that takes a number of seconds to hang around before exiting. This is my testing methodology:

From another command line:

$ ps ax|grep hangaround
21433 pts/2    S+     0:00 /bin/sh -c ./hangaround 30
21434 pts/2    S+     0:00 ./hangaround 30
21436 pts/0    R+     0:00 grep hangaround

That’s Linux; for Mac OS X:

>>> process.pid
82079

$ ps ax|grep hangaround
82079 s005  S+     0:00.01 ./hangaround 30
82084 s006  R+     0:00.00 grep hangaround

So, the upshot is that I’m a little confused about how I’m going to create a general solution to work around this problem– a problem that doesn’t occur very often but makes FATE fail hard when it does show up.

Followup:

Python Bit Classes

Here’s a little project of absolutely no use to anyone (a specialty of mine, as if you didn’t know): Pure Python classes for writing and reading bitstreams. This was just one of those things where I was sitting around wondering what it would take to accomplish, and a cursory Google search didn’t reveal anything useful (though it’s probably out there, in all likelihood), so I sat down and pounded out the code.

To what end? Oh, I don’t know– reimplement FFmpeg in Python; go crazy. Behold brute force bit banging in Python:

Continue reading