Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Archives:

My Own Offline RSS Reader

March 28th, 2010 by Multimedia Mike

My current living situation saddles me with a rather lengthy commute. More time to work on my old Asus Eee PC 701 (still can’t think of a reason to get a better netbook). It would be neat if I could read RSS feeds offline using Ubuntu-based Linux on this thing. But with all the Linux software I can find, that’s just not to be. I think the best hope I had was Google Reader in offline mode using Google Gears. But I couldn’t get it installed in Firefox and the Linux version of Gears doesn’t support the Linux version of Chrome. I did a bunch of searching beside and all I could find were forum posts with similar laments: Offline RSS readers don’t allow you to read things offline. Actually, to be fair, I think these offline RSS readers operate exactly as advertised: They allow you to read the RSS feeds offline. The problem is that an RSS feed doesn’t usually contain much meat, just a title, a synopsis, and a link to the main content. What I (and, I suspect, most people) want in an “offline reader” is a program that follows those links, downloads the HTML pages, and downloads any supporting images and stylesheets, all for later browsing.

I didn’t want to have to reinvent this particular wheel, but here goes.

Here’s the pitch: Create a text file with a list of RSS feeds. Create a Python script that retrieves each. Use Python’s XML capabilities (which I have already had success with) to iterate through each item in an RSS feed. For each item, parse the corresponding link. Fetch the link and parse through the HTML. For each CSS, JS, or IMG reference, download that data as well. Compute a hash of that supporting data and replace the link with that hash. Dump that data in a local SQLite database (you knew that was coming). Dump the modified HTML page into that database as well.

Part 2 is to create a Python-based webserver that serves up this data from a localhost address.

One nifty aspect of this idea is that my Eee PC does not have to do the actual RSS updating. If the relevant scripts and the SQLite database are stored on a Flash drive, the updating process can be run on any system with standard Python.

See Also:

  • Part 2, where I get this idea to a minimally functioning state
  • GhettoRSS, what I eventually called the software when I released it

Posted in Outlandish Brainstorms | 7 Comments »

7 Responses

  1. nine Says:

    On this note, please amend your RSS feed to include the full text of your entries. It’s much easier to read the entirety of the post in google reader than having to open a new tab for each item I want to read!
    Much easier just to open tabs on the items I want to comment on.

  2. Multimedia Mike Says:

    Thanks for the feedback. I just tried to configure WordPress to do full text in feeds but it didn’t take (it’s still just showing the summary).

    Edit: It looks like the full text is showing now; might have just needed a cache refresh.

  3. Mans Says:

    You may find feeds like those from http://daringfireball.net/ tricky to parse for the right links, the title usually linking directly to some web page he is commenting on, and another link within the RSS entry body pointing at the DF post itself.

  4. Multimedia Mike Says:

    Thanks for the lead, Mans. I’m looking at all manner of strangeness in different RSS feeds right now. Let’s hear it for standards.

  5. Multimedia Mike Says:

    You’re right, Mans: I just tried DF’s RSS with my tool and it downloads the external articles. Each post has several links. The first one is marked ‘alternate’ (leads off-site) while the other refers to the site’s custom URL-shortener.

  6. Bobby Says:

    If you insist on throwing things into SQLite, I doubt this helps you, but doesn’t wget already handle downloading a page and all the various files it references while updating the links in the page for you?

  7. Multimedia Mike Says:

    @Bobby: It’s supposed to. In fact, that’s the first thing I tried. But I couldn’t make it work correctly.