For my latest silly little project, I created an offline database containing information about Apple’s iPhone App Store, made it browseable offline with full text search, and even generated some nifty charts and tables about the apps. Wanna see?
Go here for all of the juicy data.
Introduction
Have you ever gotten one little spark of an idea and started to research and prototype it, only to have it snowball into something absolutely unrecognizable in a short period of time? As I write up this idea, I can scarcely remember how I started down the path of creating an offline, web browser-accessible version of Apple’s iPhone App Store. On the way, I learned a bunch about modern web programming and even full text search.
What’s wrong with me anyway? Why can’t I do simple exercises to come up to speed on certain well-established concepts? I think normal people would start out by developing trivial websites showcasing pictures of their cats when trying to get up to speed on modern web development. But no, not me. No, I just have to punish myself by dreaming up the most outlandish scenario in which to learn some technology, purely as an ancillary goal to a bizarre primary focus.
The Pitch
Look, here’s how it happened: Remember, I contribute heavily to this video game database called MobyGames. The database excels in archiving data about arcane systems and obscure, archaic titles. That, and Barbie games, thanks to my tireless efforts. The database has challenges keeping up with all the latest and greatest releases for all kinds of systems. However, this whole iPhone App Store thing is really throwing us a curveball. In just a little over a year, over 13,000 unique titles categorized as “games” have accumulated for the iPhone/iPod Touch system. At the time of this writing, MobyGames has cataloged a meager 111 of these titles.
Pundits and critics have been observing during the first year of the App Store’s life that there are lots of rather simple casual games populating the store. E.g., there are lots of Tetris clones. Thus, I got an idea of attempting to catalog all of the Tetris clones efficiently. How to do this? Well, it’s fruitless to try to search for “tetris” or “tetris clone” in Apple’s App Store. All you find are the officially licensed Tetris titles. A search for “falling blocks” is not much more productive.
Further, I want to be able to track on an ongoing basis just how many games are in the App Store that are not in MobyGames. This is not especially straightforward to learn through the App Store interface. However, sites like Apptism and AppBeacon seem to have cataloged all of the store’s apps just fine.
How do they do it? Does the App Store provide an API? Not likely. Would it be possible to crawl and screen-scrape the Apptism site and build a database of all the apps? Not too far-fetched. But I figure they probably have a method to automatically collect new App Store information. So I got to thinking.
The Process
Long story short, I went to work with a protocol analyzer (ngrep) and a high level language (Python) and in very short order, I was crawling the official Apple App Store and building an offline database of information.
Next Steps
Great, so now I have an offline database of iPhone app metadata including their release dates, sellers, versions, App Store pages represented in XML, 100×100 JPEG icons, and prices. Now what? The first thing I thought to do was to aggregate statistics about apps. And when deciding on how best to present the information, that’s when I learned about Google’s Visualization API which finally movitated me to learn basic JavaScript. That’s also the point that I realized Google Viz would be a perfect fit for FATE. Which is how the latest revision came to pass.
So, I present — with the help of Google’s Visualization API — some pretty graphs and tables regarding Apple’s iPhone App Store. There is a page for each category and one page for the overall store (many apps are classified in multiple categories). Each page shows app count vs. price, 10 most expensive apps, 10 largest apps, and other items. There are also some stats regarding sellers, e.g., which sellers are selling the most apps.
Some general observations about the data: There is no shortage of stories in the tech press regarding the trouble people experience trying to get their single shiny iPhone app approved. How must those people feel when they see the sellers who have literally hundreds of apps for sale in the store right now? Alternatively, do the people behind Arctic Gerbil — an entity whose App Store titles number in excess of 1500 — read such articles and wonder how anyone could possibly have so much trouble getting an app through the process?
Further, it’s quite evident that low-priced apps reign supreme unless you are offering something very unique. And I’m also not too sure where numbers like “65,000 apps” come from. My database only lists around 51,000, but there might be items that I’m not allowed to access from the region based on IP address (same thing goes for music, I’m pretty sure).
The most striking thing to learn by skimming a list of all titles in the App Store is that, whatever number of total apps that Apple might toss about in their press releases, it’s probably fair to cut it down by a certain percentage since so many apps have lite/free/demo counterparts. Each of these is entitled to its own unique App Store record. And while there are a lot of applications that do internationalization just fine in 1 package, the people behind the iPhone port of the old adventure game Flight of the Amazon Queen felt it necessary to release 6 separate versions, one for each language (all editions advertise English speech and native language subtitles).
It would be nice to pictorially represent data pertaining to release dates. E.g., graph the growth of titles in the App Store over the past year. I can’t do that, though. Apps are frequently updated and the release date I have for an app only reflects the release date of the most recent app update. I don’t know an automatic method for collecting the original release date.
Pruning The Tree
So I also need to create a list of all games that can be imported into the MobyGames missing games tracking spreadsheet. I also need to devise a system that will allow me to keep the list updated on an ongoing basis.
So I created a tool that can export the titles and some other relevant data into a CSV file to be imported into a spreadsheet and keep the same CSV data regularly updated. Unfortunately, this is when I discovered that for all the bluster about how web-based apps are going to conquer the world, there are still some limitations. It turns out that Google’s Spreadsheet doesn’t really shine with 13,000 rows of data (forget about sorting). So scratch that idea; I’ll need another way to track this data, probably something that leverages the offline database I already have.
Browsing
So where did this offline browsing idea come from? I remembered that Python’s standard library includes facilities for easily establishing a simple web server. Since I had all the metadata in a neat little database, along with all the XML and JPEG data, I figured that a few strategic regular expressions would allow me to transform the XML data into a web page that a web browser can parse. Fortunately, before I had an opportunity to mess with the RegExps, I had a small adventure involving XML parsing with Python which simplified this endeavor.
That’s a Linux machine running Google Chrome browsing the App Store — well, sort of. So, yay, I managed to make an even uglier version of the App Store, which you probably didn’t think was possible.
One more thing: How to search this text effectively in order to find all descriptions that contain, e.g., “tetris” or “falling blocks”? Even though I once wrote a naive search engine for a web site of mine, I still feel woefully unqualified in this task. Fortunately, the older and grumpier I get, the more I’m willing to seek out prepackaged solutions that can already do exactly what I want. Remember– it’s invaluable to know the right question before seeking an answer. Luckily, I learned the right question early this time. It seems that what I want in this case is something called “full text search” and there are already many existing open source packages which attempt to solve this problem.
The kind of thing I have in mind for this task is to ask a search package to initialize some kind of search database (intelligently handling the XML markup) which would then allow me to make queries later. This contrasts with the goals of some packages which seem to be standalone servers. I auditioned the following open source packages: Lucene, mNoGoSearch, Sphinx, SWISH-E, and Xapian.
SWISH-E turned out to be just what I needed. One command line tool indexed the entire collection of XML files. I hooked up a querying command line tool to the little Python web server for a search engine.
Conclusions
What an interesting journey. I finally moved beyond 1997 in terms of web programming (though the iOpenAppStoreMetaData pages don’t fully reflect that just yet), conquered my fear of XML (if not my severe loathing for the same), and became more adept at searching out pre-packaged solutions that can serve as foundations to my grander schemes. Believe me, this isn’t the end; this has only helped to generate more ideas.
The Leftovers
So I learned a lot from this little exercise. As usual, old Mikey has to learn at least one lesson the hard way that most people probably understand intuitively. This time, the lesson was: Don’t store enormous amounts of data (like JPEGs and large XML files) directly in your SQL table. I don’t care how well the non-blob columns are indexed– it’s going to take an absurdly long time to run any queries on this data. After I reprocessed the collection of data such that the XML and JPEG data no longer lived in the SQLite database, the system was blindingly fast.
Your median price calculation is wrong. On http://multimedia.cx/iOpenAppStoreMetaData/7001.html well over 400 of the 801 apps are priced at $0.99, but the median is listed as $4.99.
Also, showing the standard deviation may be a nice addition.
My personal experience on the development side of databases is small, but I suspect that your problem with large data in a database has more to do with SQLite than relational databases in general. Though I’d probably prefer keep my large data externally anyhow, if just for easy access outside of the application.
As for transforming XML documents to HTML with RegExps: “yuck”. Nuff said.
[quote cite=”Mike]Have you ever gotten one little spark of an idea and started to research and prototype it, only to have it snowball into something absolutely unrecognizable in a short period of time[/quote]
Yes. In fact, I think my projects progress fastest when they “evolve” rather than when I design them up front.
Ceterum censeo you need a preview button with the comment form.
I’ll look into the preview button, especially at the request of one of my most frequent commenters. :-) There are probably dozens of plugins and extensions to do this in WordPress.
My understanding of “median” is a little bit off. I’ll get that corrected soon.
I couldn’t think of an easy way to calculate the correct median, so I took the easier way out and removed it from the pages. I also got rid of the mode since it’s pretty obvious from looking at the graph.
Where is the difficulty in calculating the median? It’s nothing more than sorting all the elements and picking the middle one (or average of the middle two if there is an even number).
Is it that you need to be able to do it from SQL? A quick search with Google for “sql median” gave me this page, which has an almost ready to use query: http://www.1keydata.com/sql/sql-median.html
And thanks for adding a preview button. I’ve seen several WordPress blogs with one (or something similar, like a preview which is updated while you type), so I know that the plugins exist.
The problem is that this is a sophisticated SQL database with 2 tables instead of the pedagogical 1. I have to JOIN the appstore table with the genre table when breaking things down by genre and I don’t know how to eliminate duplicate rows (since apps can be placed in many genres).
I should probably just post the entire SQLite database in order to live up to the “Open” part of iOpenAppStoreMetaData.
Hey, neat stuff, I’m amazed by the quick-nature of this effort, and the amount of work you did :D You should have mentioned this on the list, I’m going to post this as some nice news :)
It actually wasn’t that quick — I started early in July. :-) It was an on and off project, like most of the things I do.
Hi, I would love to get the code for downloading all the app details from the appstore, or the protocol you discovered that is used. I would like to create a database myself, and run some queries on it. I am particularly interested in the biggest publishers, and your list only goes up to 30.
thanks, John
@John: I hope to make a few revisions to the system in the near future and then publish the SQLite database file on an ongoing basis. That should be more useful to people than publishing the harvesting utilities, which can take a long time to do their job and would only stress Apple’s servers (and cause them to implement roadblocks to doing this automatically).