Tag Archives: web search

academic research cURL data portability digital cameras firefox Flickr folksonomies Google Moderator how-to information-retrieval intellectual property linkrot Obama research search unobtrusive javascript vote web-development webspam

How to get Google search results for academic research

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.

How to link to an individual question in Google Moderator

The Obama administration’s just finished “Open for Questions“, where the President answered questions suggested and voted by the general public over the web. This is pretty cool – political openness, interaction, and democracy via the web. It’s also interesting to me because the site uses Google Moderator, a product we use at work all the time.

What’s not quite so cool is that Moderator apparently doesn’t play well with the rest of the web. I’m not sure why it was designed this way (and if I did know, I probably couldn’t tell you anyway). The design is the exact opposite of unobtrusive javascript. That’s fine for highly interactive web apps but it would be nice to see the mostly text content in Moderator made searchable just like any other collection of web pages.

Continue reading

Obsolescence and obscurity in digital cameras

University Hall Tower at OWU I’m planning on buying a new DSLR, and as I looked through old photos from college today I started to think about my first digital camera, a Philips ESP50.  Here’s a page with some specs, translated from German.

I remember buying the camera, logged in to eBay from my parents’ house late at night the day after Christmas.  I think I ended up paying something like $250 for it.

This was before the megapixel war, when 640 by 480 was considered a viable resolution.  This camera applied tortuous levels of JPG compression to fit images on the 4MB disk.  At the time, though, it seemed like a good deal.  Film cost money, and developing film cost money, and most of the year I was a ramen-noodle-eating college student.  Probably the biggest reason to go digital was the tiny little screen on the back – you could actually tell if you got the shot, instead of waiting to get back a bunch of blurry prints.

The camera is painfully obsolete now, and even then it was somewhat obscure.  The thing is, the Web was a pretty amazing place even back in 1998 – there were lots of web pages about this camera.  I remember reading at least a couple reviews, and searches for it on WebCrawler or Alta Vista or whatever I used back then came up with retailers, other auction sites, etc.  Look for information about this camera now, and it seems that it has been largely forgotten:

And that’s about it.

I wonder, is this the destiny of all cameras?  Will I do a search for my Nikon Coolpix 5700 in 2014 and come up with just as little, or has the Web expanded so quickly that the copious product reviews, blog posts, and technical discussions on photography forums outweigh the force of entropy?  I wonder if the Internet has gained any stability as it has matured – do pages tend to stick around longer, or is linkrot a constant of the universe?

Future generations will hardly feel deprived if they miss out on information about some crappy old digicam.  Still, you never know what kind of information will end up being useful to someone at some point, and this same problem extends to all the information on the Web – from reviews of obsolete products to the human genome.  If a website goes under and deletes a thousand blogs, it won’t exactly make the news.  But our great-grandchildren might look at that stuff the way we look at letters from the Civil War.

The only solutions I have are more effort behind projects like archive.org, increased data portability, and rational intellectually property laws that don’t make saving 70-year-old content from deletion into a federal crime.

For discussion, how do you deal with ancient equipment, keeping around old web content, or even archiving old email?