How to get Google search results for academic research

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.

Doing my small part to preserve digital history

High cirrus clouds and low fog over the Pacific Ocean Years ago, in an undergrad course, one the of the school’s librarians gave a talk about the big risk of the move to digital publishing – historical preservation.  We know what the ancient Greeks thought in part because their words were carved into stone – would we be so lucky if they had used floppy disks?

I wasn’t completely convinced that the situation was so dire then, and I’m still not really worried.  The production and storage of information continues to grow exponentially, and I think the real problem for future archeologists will be dealing with information overload rather than some hypothetical gap in the written record.  But I have been thinking a lot about my own digital history lately so I spent part of this weekend looking at old papers from college and publishing them on my site.

I don’t think my meager efforts will be much help to future historians (much less reverse the entropy of the universe), but I did find some interesting stuff that I probably should have posted for the world to see a long time ago.

For example:

The more I dig up and paste into my WordPress archives the more I realize a few things.  First, a distinct lack of content between undergrad and grad school – I’m doing a much better job of writing without assignments now than I did then.  Second, a hard drive crash in 2003 resulted in a gap in my saved emails – this hurts more now that I’m looking back through things.  Finally, I need to make a point, for the rest of my life, to just put things out there. It seems like such a shame that I put work into these docs just to have them rot on my hard drive.

I know some of my co-workers, Reid and Wysz, have gone through the process of resurrecting old content to their current website.  Anyone else thinking about doing something similar?  What prompted you to do so?  Or, what prevented you?

Notes: Looking for information

Case, D.O. (2002). Looking for information: A survey of research on information seeking, needs, and behavior.  New York: Academic Press.  Chapter 9: Methods: Examples by type.

In this chapter Case reviews the different methodologies employed by research studying information seeking, use, and sense-making. Although he notes a few overall studies that cast a wide net, finding overall proportions, this article is not a survey of all the literature. It instead gathers relevant examples of different types of research. The types of research included case studies, formal and field experiments, mail and Internet surveys, face-to-face and phone interviews, focus groups, ethnographic, and phenomenological studies, diaries, historical studies and content analysis. The were also multiple-method studies and meta-analysis. Case writes about some of the limitations of the different methodologies—for example, case studies have limited variables, focusing on one item or event to the exclusion of others, and they are limited in terms of time as well. The author concludes that most studies assume people make rational choices and that specific variables are more important than context. More qualitative measures are becoming more popular but cannot be generalized.

The author did a particularly good job in finding studies to examine. The best example of this are the experiments. Very few laboratory experiments have been conducted specifically on information use, but there have been many on consumer behavior—and here we consumer behavior studies that involved information gathering for decision making. Another choice I found particularly interesting was the historical research by Colin Richmond that looked at the dissemination of information in England during the Hundred Years’ War. Usually when I think of historical research in social science I think of things like comparing content analysis of newspapers of the 1950s and today. It was interesting to see thing from a historian’s point of view, and also a good reminder that people did not just start needing information with the invention of the Internet. A good, though dense, book on this topic is A Social History of Knowledge by Peter Burke.

The most immediate application of this chapter is in suggesting methodologies to use in different situations. When I’m doing research, I tend to have a bias toward sources that conducted experiments or did survey research. Reading through these cases reminded me of the usefulness of things like case studies and content analysis. Another interesting application of the chapter is in suggesting topics for further study. Although the author doesn’t really build to any general conclusion on the research topics at hand (there is no overall theme to the research) looking at the different conclusions of the different types of studies suggests some interesting questions. For example, since the study by Covell, Uman and Manning suggested that doctors report using books or journals first but in reality turn to colleagues first, how can we reexamine the studies that relied on self-reporting, such as the case study or the surveys? Perhaps some of the tactics used in the consumer research experiments would be a valuable addition.