Category Archives: Uncategorized

How to get Google search results for academic research

January 12, 2010Blog, Uncategorizedacademic research, Beautiful Soup, cURL, folksonomies, Google, Google Scholar, information-retrieval, research, screen scraping, search, web search, webspamJason Morrison

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.

New search engine – Cuil search

July 28, 2008UncategorizedCuil, error, Google, information-retrieval, scalability, search-engines, vanity searchJason Morrison

Reid posted a review already, but I thought I’d add my two cents about this new search engine, Cuil.

First off, it’s great to see more companies making a serious go at web search. I don’t speak for my employer (standard disclaimers apply), but I personally am always happy to see new attempts at information retrieval on the web. More competition can only make things better for users. Heck, I’ve even cooked up a bit of a search system based on my research into IR with tagging systems and folksonomies myself, though it’s too much of a toy to release to the public.

Second, it’s a bit underwhelming to see a ton of press coverage of a new search engine, load up the site and do a simple vanity search, only to see this:

I know I’m not exactly the most famous person in the world, but I do have a website. Really this is just the result of scaling problems – too many people hitting this brand new service at the same time. I can’t complain too much since if I ever released my little search system, it would fail at 4 concurrent users or so. But I also don’t think I could get the amount of press that they’ve managed to get either.

Third point, I don’t know much about their architecture and algorithms but from the about us page I thought this was kind of interesting:

The Internet has grown exponentially in the last fifteen years but search engines have not kept up—until now. Cuil searches more pages on the Web than anyone else—three times as many as Google and ten times as many as Microsoft.

Do they really think the main problem of web search is too few items in the index?

If you want to read more, Read/Write Web has a good review.

Radio2.0 – Last.fm will pay royalties to independent musicians

July 12, 2008UncategorizedAdsense, geek music, Google, last.fm, long tail, monetization, niche content, online radio, social networkingJason Morrison

Last.fm, a very cool online radio / music social networking site, just announced that it will pay royalties directly to independent musicians who upload their songs.

This is pretty important, for the same reason that Google’s Adsense was important (though probably a few orders of magnitude smaller impact). The Internet does a few things really, really well – quickly build network effects, encourage the creation of lots of long tail and niche content, etc. It also has the potential to cut out the middleman in economic transactions and help pay small-audience writers, artists, and musicians, so long as there’s a viable monetization system.

Adsense is that monetization system for a huge number of web sites, and hopefully things like Last.fm’s royalty program and CDBaby will be the engine that drives more interesting music online.

By the way, I started the Geek Music group a few years ago. Feel free to join, your listening habits will help us determine the best music to put on when writing code.

JasonMorrison.net

Usability, web development, and design

Category Archives: Uncategorized

How to get Google search results for academic research

New search engine – Cuil search