My Own Offline RSS Reader (Part 2) | Breaking Eggs And Making Omelettes

About that “true” offline RSS reader that I pitched in my last post, I’ll have you know that I made a minimally functioning system based on that outline.

These are the primary challenges/unknowns that I assessed from the outset:

Manipulating relative URLs of supporting files
Parsing HTML in Python
Searching and replacing within the HTML file
Downloaded .js files that include other .js files

For #1, Python’s urlparse library works wonders. For #2 and #3, look no farther than Python’s HTMLParser module. This blog post helped me greatly. I have chosen not to address #4 at this time. I’m not downloading any JavaScript files right now; the CSS and supporting images are mostly adequate.

Further, it turned out not to be necessary to manually build an XML parser. Whenever I encountered a task that felt like it was going to be too much work — like manually parsing the XML feeds using Python’s low-level XML systems — a little searching revealed that all the hard work was already done. In the case of parsing the RSS files, the task was rendered trivial thanks to FeedParser.

Brief TODO list, for my own reference:

Index the database tables in a sane manner
Deal with exceptions thrown by malformed HTML
Update the post table to indicate that a post has been “read” when it is accessed
Implement HTTP redirection (since some RSS feeds apparently do that)
Implement cache control so that the browser will properly refresh feed lists
Add a stylesheet that will allow the server to control the appearance of links depending on whether or not the posts have been read
Take into account non-ASCII encoding (really need to train myself to do this from the get-go)
Forge user agent and referrer strings in HTTP requests, for good measure
Slap some kind of UI prettiness on top of the whole affair; I’m thinking an accordian widget containing tables might work well and I think there are a number of JavaScript libraries that could make that happen

Once I get that far, I’ll probably put some code out there. Based on what I have read, I’m not the only person who is looking for a solution like this.

I eventually released this software. Find it on Github.

7 thoughts on “My Own Offline RSS Reader (Part 2)”

Steven Robertson March 29, 2010 at 8:55 am

BeautifulSoup_ is often a better choice than HTMLParser; it does a spectacular job of handling malformed HTML and Unicode properly.

http://www.crummy.com/software/BeautifulSoup/

Multimedia Mike Post authorMarch 29, 2010 at 9:24 am

I saw Beautiful Soup when I was doing research for this little project. I’m not sure if it operates as simply as HTMLParser. The latter required a rather minimal amount of code (malformed HTML notwithstanding, but I might be able to hack around that).

Adam Ehlers Nyholm Thomsen March 29, 2010 at 1:04 pm

Beautiful Soup often requires very little code to get this sort of quick and dirty stuff done. However on a side note, from this: http://www.crummy.com/software/BeautifulSoup/3.1-problems.html I would guess that beautiful soup isn’t really a building block you want to build a new piece of software on top of.

Z.T. March 29, 2010 at 2:01 pm

httplib2 handles HTTP redirects, caching, compression, etc. transparently.

AurÃ©lien Bompard April 5, 2010 at 9:37 am

I wrote a few months ago something which may interest you : it’s a python script which takes an RSS feed as argument, downloads each entry with wget in a separate folder, and creates an HTML index for those folders.

I’ve been using it for a few months, and it works extremely well. I rsync the folder to my portable device and bookmark the index page, and that’s all there is to it. It’s not in a perfect shape code-wise so I did not put is on the web until now, but I can do it if you’re interested.

OK since someone seemed interested, I’ve put the script on the web: http://gitorious.org/abompard-scripts/abompard-scripts/blobs/master/rss-mirror.py

Feel free to contact my by email if you need it, or using my website’s contact form: http://aurelien.bompard.org/contact

Erik May 25, 2010 at 10:01 am

Hi –

A better solution, in my opinion:

1) From list of feeds, extract the entry URLs;
2) Open each entry in Firefox to seed the cache;
3) Use offline feed reader. Use offline mode in Firefox.

A few snags: you probably don’t want to see Firefox open all those windows. You can use xvfb to hide it. You might try using something like POW (firefox personal web server) to script firefox so that it could open & then close each window.

Comments are closed.