About that “true” offline RSS reader that I pitched in my last post, I’ll have you know that I made a minimally functioning system based on that outline.
These are the primary challenges/unknowns that I assessed from the outset:
- Manipulating relative URLs of supporting files
- Parsing HTML in Python
- Searching and replacing within the HTML file
- Downloaded .js files that include other .js files
For #1, Python’s urlparse library works wonders. For #2 and #3, look no farther than Python’s HTMLParser module. This blog post helped me greatly. I have chosen not to address #4 at this time. I’m not downloading any JavaScript files right now; the CSS and supporting images are mostly adequate.
Further, it turned out not to be necessary to manually build an XML parser. Whenever I encountered a task that felt like it was going to be too much work — like manually parsing the XML feeds using Python’s low-level XML systems — a little searching revealed that all the hard work was already done. In the case of parsing the RSS files, the task was rendered trivial thanks to FeedParser.
Brief TODO list, for my own reference:
- Index the database tables in a sane manner
- Deal with exceptions thrown by malformed HTML
- Update the post table to indicate that a post has been “read” when it is accessed
- Implement HTTP redirection (since some RSS feeds apparently do that)
- Implement cache control so that the browser will properly refresh feed lists
- Add a stylesheet that will allow the server to control the appearance of links depending on whether or not the posts have been read
- Take into account non-ASCII encoding (really need to train myself to do this from the get-go)
- Forge user agent and referrer strings in HTTP requests, for good measure
- Slap some kind of UI prettiness on top of the whole affair; I’m thinking an accordian widget containing tables might work well and I think there are a number of JavaScript libraries that could make that happen
Once I get that far, I’ll probably put some code out there. Based on what I have read, I’m not the only person who is looking for a solution like this.
I eventually released this software. Find it on Github.
BeautifulSoup_ is often a better choice than HTMLParser; it does a spectacular job of handling malformed HTML and Unicode properly.
http://www.crummy.com/software/BeautifulSoup/
I saw Beautiful Soup when I was doing research for this little project. I’m not sure if it operates as simply as HTMLParser. The latter required a rather minimal amount of code (malformed HTML notwithstanding, but I might be able to hack around that).
Beautiful Soup often requires very little code to get this sort of quick and dirty stuff done. However on a side note, from this: http://www.crummy.com/software/BeautifulSoup/3.1-problems.html I would guess that beautiful soup isn’t really a building block you want to build a new piece of software on top of.
httplib2 handles HTTP redirects, caching, compression, etc. transparently.
I wrote a few months ago something which may interest you : it’s a python script which takes an RSS feed as argument, downloads each entry with wget in a separate folder, and creates an HTML index for those folders.
I’ve been using it for a few months, and it works extremely well. I rsync the folder to my portable device and bookmark the index page, and that’s all there is to it. It’s not in a perfect shape code-wise so I did not put is on the web until now, but I can do it if you’re interested.
OK since someone seemed interested, I’ve put the script on the web: http://gitorious.org/abompard-scripts/abompard-scripts/blobs/master/rss-mirror.py
Feel free to contact my by email if you need it, or using my website’s contact form: http://aurelien.bompard.org/contact
Hi –
A better solution, in my opinion:
1) From list of feeds, extract the entry URLs;
2) Open each entry in Firefox to seed the cache;
3) Use offline feed reader. Use offline mode in Firefox.
A few snags: you probably don’t want to see Firefox open all those windows. You can use xvfb to hide it. You might try using something like POW (firefox personal web server) to script firefox so that it could open & then close each window.