Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Archives:

Method For Crawling Google

May 27th, 2011 by Multimedia Mike

I wanted to crawl Google in order to harvest a large corpus of certain types of data as yielded by a certain search term (we’ll call it “term” for this exercise). Google doesn’t appear to offer any API to automatically harvest their search results (why would they?). So I sat down and thought about how to do it. This is the solution I came up with.



FAQ
Q: Is this legal / ethical / compliant with Google’s terms of service?
A: Does it look like I care? Moving right along…

Manual Crawling Process
For this exercise, I essentially automated the task that would be performed by a human. It goes something like this:

  1. Search for “term”
  2. On the first page of results, download each of the 10 results returned
  3. Click on the next page of results
  4. Go to step 2, until Google doesn’t return anymore pages of search results

Google returns up to 1000 results for a given search term. Fetching them 10 at a time is less than efficient. Fortunately, the search URL can easily be tweaked to return up to 100 results per page.

Expanding Reach
Problem: 1000 results for the “term” search isn’t that many. I need a way to expand the search. I’m not aiming for relevancy; I’m just searching for random examples of some data that occurs around the internet.

My solution for this is to refine the search using the “site” wildcard. For example, you can ask Google to search for “term” at all Canadian domains using “site:.ca”. So, the manual process now involves harvesting up to 1000 results for every single internet top level domain (TLD). But many TLDs can be more granular than that. For example, there are 50 sub-domains under .us, one for each state (e.g., .ca.us, .ny.us). Those all need to be searched independently. Same for all the sub-domains under TLDs which don’t allow domains under the main TLD, such as .uk (search under .co.uk, .ac.uk, etc.).

Another extension is to combine “term” searches with other terms that are likely to have a rich correlation with “term”. For example, if “term” is relevant to various scientific fields, search for “term” in conjunction with various scientific disciplines.

Algorithmically
My solution is to create an SQLite database that contains a table of search seeds. Each seed is essentially a “site:” string combined with a starting index.

Each TLD and sub-TLD is inserted as a searchseed record with a starting index of 0.

A script performs the following crawling algorithm:

  • Fetch the next record from the searchseed table which has not been crawled
  • Fetch search result page from Google
  • Scrape URLs from page and insert each into URL table
  • Mark the searchseed record as having been crawled
  • If the results page indicates there are more results for this search, insert a new searchseed for the same seed but with a starting index 100 higher

Digging Into Sites
Sometimes, Google notes that certain sites are particularly rich sources of “term” and offers to let you search that site for “term”. This basically links to another search for ‘term site:somesite”. That site gets its own search seed and the program might harvest up to 1000 URLs from that site alone.

Harvesting the Data
Armed with a database of URLs, employ the following algorithm:

  • Fetch a random URL from the database which has yet to be downloaded
  • Try to download it
  • For goodness sake, have a mechanism in place to detect whether the download process has stalled and automatically kill it after a certain period of time
  • Store the data and update the database, noting where the information was stored and that it is already downloaded

This step is easy to parallelize by simply executing multiple copies of the script. It is useful to update the URL table to indicate that one process is already trying to download a URL so multiple processes don’t duplicate work.

Acting Human
A few factors here:

  • Google allegedly doesn’t like automated programs crawling its search results. Thus, at the very least, don’t let your script advertise itself as an automated program. At a basic level, this means forging the User-Agent: HTTP header. By default, Python’s urllib2 will identify itself as a programming language. Change this to a well-known browser string.
  • Be patient; don’t fire off these search requests as quickly as possible. My crawling algorithm inserts a random delay of a few seconds in between each request. This can still yield hundreds of useful URLs per minute.
  • On harvesting the data: Even though you can parallelize this and download data as quickly as your connection can handle, it’s a good idea to randomize the URLs. If you hypothetically had 4 download processes running at once and they got to a point in the URL table which had many URLs from a single site, the server might be configured to reject too many simultaneous requests from a single client.

Conclusion
Anyway, that’s just the way I would (and did) do it. What did I do with all the data? That’s a subject for a different post.

Adorable spider drawing from here.

Posted in Big Data | 6 Comments »

6 Responses

  1. Mathias Says:

    Did you have a look into scroogle? They use the google to let you search anonymously and give you 100 results back at a time. I’m not sure if they write how they do it, but they might.

  2. Diego E. Pettenò Says:

    I’d be surprised if just setting the user agent to one of a real browser is enough to fool Google. It definitely isn’t enough to fool me. My modsec custom ruleset is designed exactly to keep crawlers with fake user agent strings away from my site…

  3. James Says:

    https://ajax.googleapis.com/ajax/services/search/web?v=1.0&rsz=large&q=term&start=0

    Might have been worth a look. It’s limited to the top 64 results, eight at a time.

    You are likely to run afoul of their anti TOS-violation code if you hit it too fast, though that will apply to a normal Google search too. Alternately, adding &num=100 to a normal Google query might have sped things up a little.

  4. Multimedia Mike Says:

    To reiterate, the method outlined was absolutely successful and I was able to harvest hundreds of thousands of useful URLs. Google never blocked my crawler. I used Python’s urllib2 and the only header I modified was “User-Agent:”, cloned to simulate Apple’s Safari, according to my old code. My script had a random delay of 15-25 seconds between searches.

    @Diego: Is that a challenge? :-)

    @James: Yeah, I did just that (might not have been clear in the original post)– I requested 100 URLs at a time. Up to 10 searches for each term leading to possibly 1000 URLs/search.

  5. AD Says:

    I’ve never really had any use for storing a list of internet search results. But I can see that this could be a valuable excercise.

    Often I will want/need to search the web for some obscure result(s) within a pretty broad range on a given topic. The advanced search options or boolean in general can give you a narrowed search field but over time I’ve still had the best success in determining relevant search results by scanning the content descriptions. So far googles results still don’t look beyond the actual keywords and into the thematic elements of the pages themselves. Google doesn’t know when someone is talking about a movie when they’re not using the word movie or even if they aren’t mentioning the specific title in question (perhaps on purpose… hey, this is just an example).

    While I’ve been finding Google’s search results have been getting a lot worse lately they still seem better than the alternatives. Lately there’s more and more pages of results that include exactly the same content descriptions but with a different URLS… Ther’s many many various Wiki pages and IMDB clones in countless languages that aren’t offering up actual original content. Why isn’t google good at filtering this out? So much of it is obviously shameless advertisers/phishers using other people’s content to lure in websurfers.

    So would crawling web results be useful? It sure would. A secondary program that could sort the thousands of results to my specs would be more than welcome. How hard would it be to group together all the entries with identical descriptions?

    Maintaining a list of relevant URLS combined with being able to crossreference this weeks search with the one I did a week/month/year ago would also tell me what pages are actually more likely to be new compared to something that got a new timestamp because of a blog comment or because it’s the same content but with new advertising.

  6. Multimedia Mike Says:

    @AD: You might be onto something here.