Archive for the ‘Uncategorized’ Category

Academic Papers Artwork baby names Blog blogging democracy Design ethics Facebook firefox Flickr folksonomies Google Google Docs Google Spreadsheets how-to information-architecture information-retrieval information design internet iphone journalism listserv mailing list maps mass media Online News Papers Photography plugin poll social-bookmarking social networking social software spam tagging trust Twitter Usability web-development Web2.0 webspam web standards WordPress Writing

How to get Google search results for academic research

Tuesday, January 12th, 2010

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.

New search engine – Cuil search

Monday, July 28th, 2008

Reid posted a review already, but I thought I’d add my two cents about this new search engine, Cuil.

First off, it’s great to see more companies making a serious go at web search. I don’t speak for my employer (standard disclaimers apply), but I personally am always happy to see new attempts at information retrieval on the web. More competition can only make things better for users. Heck, I’ve even cooked up a bit of a search system based on my research into IR with tagging systems and folksonomies myself, though it’s too much of a toy to release to the public.

Second, it’s a bit underwhelming to see a ton of press coverage of a new search engine, load up the site and do a simple vanity search, only to see this:

Problems with Cuil search

I know I’m not exactly the most famous person in the world, but I do have a website. Really this is just the result of scaling problems – too many people hitting this brand new service at the same time. I can’t complain too much since if I ever released my little search system, it would fail at 4 concurrent users or so. But I also don’t think I could get the amount of press that they’ve managed to get either.

Third point, I don’t know much about their architecture and algorithms but from the about us page I thought this was kind of interesting:

The Internet has grown exponentially in the last fifteen years but search engines have not kept up—until now. Cuil searches more pages on the Web than anyone else—three times as many as Google and ten times as many as Microsoft.

Do they really think the main problem of web search is too few items in the index?

If you want to read more, Read/Write Web has a good review.

Radio2.0 – Last.fm will pay royalties to independent musicians

Saturday, July 12th, 2008

Last.fm, a very cool online radio / music social networking site, just announced that it will pay royalties directly to independent musicians who upload their songs.

This is pretty important, for the same reason that Google’s Adsense was important (though probably a few orders of magnitude smaller impact). The Internet does a few things really, really well – quickly build network effects, encourage the creation of lots of long tail and niche content, etc. It also has the potential to cut out the middleman in economic transactions and help pay small-audience writers, artists, and musicians, so long as there’s a viable monetization system.

Adsense is that monetization system for a huge number of web sites, and hopefully things like Last.fm’s royalty program and CDBaby will be the engine that drives more interesting music online.

By the way, I started the Geek Music group a few years ago.  Feel free to join, your listening habits will help us determine the best music to put on when writing code.

Embedding Google Docs and Spreadsheets into your Blog Posts

Sunday, July 6th, 2008

I just wrote a post about buying a new camera, and because I want to compare specs on several different cameras and lenses, I’m going to need a spreadsheet.  Luckily there are some great online spreadsheet programs to chose from.  I’m going to use this as an opportunity to explore how to use Google Docs and Spreadsheets in blog posts.

Before you get started I’m assuming you already have a Google Docs spreadsheet ready to go.

1.  You can always just link to the document. By default your docs will be private so you’ll need to make them available to your readers.  To do so you’ll need to either go to the Share tab and check “Anyone can view this document WITHOUT LOGGING IN at:” or go to the Publish tab and publish the doc. Either way you’ll get regular URL to post, like this one:  http://spreadsheets.google.com/ccc?key=ppevxmL24UqmeiZSbqIU1DQ&hl=en

Links aren’t very exciting though, so how can you embed into a post instead?

2.  You can embed the content into the post.  If you’re wondering how to do it in WordPress, one solution I’ve come across is the Inline Google Docs plugin at Broken Watch.  This plugin gets the actual text/html of the spreadsheet and places it inline in your post.  So if you have a wide blog template, or a spreadsheet with relatively few columns, it should blend right in.  On the other hand, there’s no editing or other fun.

Here’s an example of what the output looks like:

NOTE: I had to disable this, it was throwing errors once I upgraded to WordPress 2.7. You mileage may vary.

3.  You can put the doc directly in the page with an iframe. This works really, really well with Google Presentations but is a bit trickier with a doc and even less optimal with a spreadsheet. You’ll get the best-looking results if you publish the document and use the published URL in the iframe. On the other hand if you use the shared URL collaborators should be able to make changes right in your blog post.

You’ll want to create some code like this:

<iframe src=”http://spreadsheets.google.com/pub?key=ppevxmL24UqmeiZSbqIU1DQ” width=”500″ height=”400″></iframe>

Make sure you put the code in the “HTML” editing mode of WordPress rather than “Visual” mode.  As a result you can see some of the info I’ve gathered about possible camera / lens combinations in the spreadsheet below.

The main issue here is the relatively small iframe window size. If you use a wider blog template this technique might work really well.

Why bother? Spreadsheets aren’t the most exciting thing in the world for most people, but play around with all the features of Google Docs and Spreadsheets and you’ll see why this can be pretty cool.  You can embed questionnaires and surveys, cool charts and graphs with Gadgets, and anything else you can think of.

Google Earth vs. Reality, Revisited

Friday, June 6th, 2008

Last week I compared some real-life photos with the same scene in Google Earth.  Since I’m a bit of a computer/mapping/photography geek, I couldn’t resist doing a few more.  That actually ended up being a pretty popular post, with thousands of pageviews, which just goes to show I’m not the only combination computer/mapping/photography geek out there.

Here’s a view of San Francisco from Coit Tower on Telegraph Hill.  Follow this link to see larger versions in Flickr.  This one is even better than the two from last week – look how well the streets, buildings, and Golden Gate Bridge match with the photo.

Google Earth vs. Reality - San Francisco from Coit Tower

Now I’ll go a little more international.  Here’s a photo from the site of ancient Mycenae in Greece.  This is above the famous Lion Gate looking out tat the hills surrounding the Argolid plain.  See larger versions in Flickr.  The aerial photograph that Google Earth maps to the topography isn’t as detailed as the real life photo, but even the borders of the olive groves line up.

Google Earth vs. Reality - Mycenae, Greece

These next two are not as identical as the San Francisco cityscapes, but are still impressive because of how well they evoke the real life scenes without 3-d buildings.

The first is from the Acropolis in Athens, looking out over the surrounding neighborhood.  Larger versions in Flickr.

Google Earth vs. Reality - Athens from the Acropolis

Here’s another shot from the Acropolis showing the new Acropolis Museum.  Larger versions in Flickr.

Google Earth vs. Reality - Athens and the new Acropolis Museum

If you feel like making some comparisons of your own, please let me know in the comments below – I’d love to see what other people could come up with.

Scientific proof that Reddit should add a tagging system

Tuesday, June 3rd, 2008

First, a disclaimer: the title of this post is obviously exaggerated. Proof is an awfully big word to throw around, and although I employed pretty good experiment design practices and statistical checks, I can’t really prove that Reddit should do this or that. But I can show that what they are doing now is not working, at least when it comes to search.

So, I got an email the other day letting me know that my article, Tagging and Searching: Search Retrieval Effectiveness of Folkonsomies on the World Wide Web, is being published in the July 2008 issue of Information Processing and Management (here’s the official DOI link to the article). In the study I compared search performance between traditional search engines (like Google), subject directories (like Open Directory), and social bookmarking systems (like Reddit) and their folksonomies.

What’s a folksonomy? The word is a play on the term taxonomy – a taxonomy is a system of organizing and categorizing things, like the Dewey Decimal System. Taxonomies usually follow very strict rules and are controlled by experts. A folksonomy is a system of organization built by large numbers of regular users, who add things to the collection, evaluate them, and usually tag them with keywords.

IR-system-precision-1-20

In my study, the social bookmarking systems with tagging systems did surprisingly well – Del.icio.us was more precise than Open Directory, and at a cut off of 20 results it’s precision was fairly close to that of the search engines.

Reddit, however, did not fare so well. It consistently had the lowest precision, meaning that searches returned very few relevant results. There could be many reasons for this, but the biggest difference between Reddit and the others is the lack of tags.

Now, it’s possible that the folks at Reddit have no interest in search, or information retrieval in general. I think Reddit is very effective at bringing out new and interesting links on a daily basis and encouraging commentary (just my opinion, no stats to back that up). But I think it’s a big missed opportunity not to add tagging and see where it leads.

(One last disclaimer: this post is my personal opinion as someone who enjoys using Reddit and does not reflect on my employer. This post refers to research done independently as a grad student.)

XHTML 2 vs HTML 5 and the href Attribute

Monday, June 2nd, 2008

Spider web window - common motif in the Winchester HouseI wrote a little earlier about what I was looking forward to in HTML 5.  I haven’t had a chance to really collect my thoughts about XHTML 2 vs HTML 5, to be honest I’d be happy to see progress on both fronts.  I do have to say I lost interest in XHTML 2 early on when it seemed they were throwing some baby out with the bathwater.  HTML is not the cleanest, most elegant language but the ease of picking it up is part of why the web grew so quickly.  Even if that has forced browsers to cope with millions of pages of clunky, broken HTML.

Eric Meyer has at least one point in XHTML 2′s favor – the ability to add and href attribute to anything, making it a link.  In addition to making the <a> tag jealous, this would let you do some pretty cool stuff like turn an entire table row into a link in a dynamic data reporting web app without a lot of Javascript or duplicated tags.

By the way Eric is a fellow member of the Cleveland Web Standards Association and a great speaker.  If you get a chance to see a talk by him you should really check it out.

Joel Swerdlow – 1 Billion Cokes a Day: World Culture at the Millennium

Monday, November 8th, 1999

Swerdlow’s lecture opened with an interesting anecdote-right now, there are six billion people on the planet.  Somehow Coca-Cola calculates that these 6 billion consume 37 billion drinks a day.  Coke sells 1 billion Cokes a day, and their stated corporate goal is to get the other 36.

Creepy?  I think so.  Swerdlow presented six points about the state of the world as of the end of the second millenium:

  • Human population-doubled since World War II;
  • Destruction of biodiversity-99 percent of all species are dead, and though nature killed most, humans are now perpetrating one of the largest mass extinctions ever;
  • The physical earth-global warming, destruction of ozone, etc;
  • Exploration-we’ve been to most of the earth’s surface, but sea and space remain;
  • Science-advancing faster than ever;
  • Culture-90 percent of languages spoken now will not be spoken by the end of the next century-but so what?

Therefore he presents four questions:

What is culture?  It’s the behaviors, facts, patters, etc. that people pass on.  National Geographic decided to look at the largest/most important cities at the years 1, 1000 and 2000 A. D.-Alexandria, Cordoba and New York respectively.  In the first millennium, there was one main new idea changing world culture-monotheism.  In the second, Swerdlow sees two: modern science and human equality.  More specifically, there are four important, overarching changes going on right now:

1.  The end of the remote.  The U. N. estimates that only 1/18 of the world’s population are in indigenous cultures.  The Nazi’s had a plan to stop the mixing and migration of people from different cultures, but they never got a chance to implement it.  They would have laid out zones in which no one would be allowed in or out.

On the other hand, there are medicines known to so-called primitive cultures that science has yet to discover.  So what makes a culture primitive?

2.  The growth of cities.  Swerdlow said that when his father was born in North Dakota, only two percent of the population lived in cities, but now 50 percent do.   China, he said, has 100-200 million rural people in cities looking for work.  Cities are a human invention and most of what we call culture comes from them.  Other human ideas that have been picked up by some cultures and not others include the wheel, spaces between words and even reading silently.  So why do dome ideas catch on?

3.  Modern science.  Why did modern science arrive in Europe in the 1600s-1800s and not elsewhere?  In the 1400s China had a fleet of treasure ships four times longer than Columbus’ ships, but turned away from the outside world.  So is technology resistible?

Swerdlow could only think of three cases where it has been.  First, of course, is the Chinese turn away from navigation.  Not only did the government stop building boats, but people were forbidden to and books were burned.  Second, guns were introduced in Japan in the 1500s and by 1600 they had the most sophisticated guns in the world.  They decided the weapons were too dangerous and gave them up.  Finally, water separated Tasmania from Australia about 10,000 years ago, but by the time Europeans arrived in the 1700s the Tasmanians had given up even stone tools.

4.  The spread of American culture.  According to Swerdlow, this is one of the first times in history a culture has spread so quickly without troops.  He was in a rural area near Calcutta during a harvest festival.  Elsewhere the movie Titanic was playing, and though the festival drew more people, it wouldn’t be that way for long.  So what does it mean that American culture is spreading?

Swerdlow could find one theme that may be driving the spread of American culture-equality.

The first thing I did to find further info was check out the National Geographic site.