Category Archives: Uncategorized

Academic Papers Artwork baby names Blog blogging democracy Design ethics Facebook firefox Flickr folksonomies Google Google Docs Google Spreadsheets how-to information-architecture information-retrieval information design internet iphone journalism listserv mailing list maps mass media Online News Papers Photography plugin poll social-bookmarking social networking social software spam tagging trust Twitter Usability web-development Web2.0 webspam web standards WordPress Writing

How to get Google search results for academic research

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.

New search engine – Cuil search

Reid posted a review already, but I thought I’d add my two cents about this new search engine, Cuil.

First off, it’s great to see more companies making a serious go at web search. I don’t speak for my employer (standard disclaimers apply), but I personally am always happy to see new attempts at information retrieval on the web. More competition can only make things better for users. Heck, I’ve even cooked up a bit of a search system based on my research into IR with tagging systems and folksonomies myself, though it’s too much of a toy to release to the public.

Second, it’s a bit underwhelming to see a ton of press coverage of a new search engine, load up the site and do a simple vanity search, only to see this:

Problems with Cuil search

I know I’m not exactly the most famous person in the world, but I do have a website. Really this is just the result of scaling problems – too many people hitting this brand new service at the same time. I can’t complain too much since if I ever released my little search system, it would fail at 4 concurrent users or so. But I also don’t think I could get the amount of press that they’ve managed to get either.

Third point, I don’t know much about their architecture and algorithms but from the about us page I thought this was kind of interesting:

The Internet has grown exponentially in the last fifteen years but search engines have not kept up—until now. Cuil searches more pages on the Web than anyone else—three times as many as Google and ten times as many as Microsoft.

Do they really think the main problem of web search is too few items in the index?

If you want to read more, Read/Write Web has a good review.

Radio2.0 – will pay royalties to independent musicians, a very cool online radio / music social networking site, just announced that it will pay royalties directly to independent musicians who upload their songs.

This is pretty important, for the same reason that Google’s Adsense was important (though probably a few orders of magnitude smaller impact). The Internet does a few things really, really well – quickly build network effects, encourage the creation of lots of long tail and niche content, etc. It also has the potential to cut out the middleman in economic transactions and help pay small-audience writers, artists, and musicians, so long as there’s a viable monetization system.

Adsense is that monetization system for a huge number of web sites, and hopefully things like’s royalty program and CDBaby will be the engine that drives more interesting music online.

By the way, I started the Geek Music group a few years ago.  Feel free to join, your listening habits will help us determine the best music to put on when writing code.