Posts Tagged ‘information-retrieval’

Academic Papers Blog case studies Cuil entropy folksonomies Google HCI information-architecture information seeking methodologies Projects research search-engines social-bookmarking Taxonomies Web2.0 web search Writing

How to get Google search results for academic research

Tuesday, January 12th, 2010

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.

Doing my small part to preserve digital history

Monday, August 18th, 2008

High cirrus clouds and low fog over the Pacific Ocean Years ago, in an undergrad course, one the of the school’s librarians gave a talk about the big risk of the move to digital publishing – historical preservation.  We know what the ancient Greeks thought in part because their words were carved into stone – would we be so lucky if they had used floppy disks?

I wasn’t completely convinced that the situation was so dire then, and I’m still not really worried.  The production and storage of information continues to grow exponentially, and I think the real problem for future archeologists will be dealing with information overload rather than some hypothetical gap in the written record.  But I have been thinking a lot about my own digital history lately so I spent part of this weekend looking at old papers from college and publishing them on my site.

I don’t think my meager efforts will be much help to future historians (much less reverse the entropy of the universe), but I did find some interesting stuff that I probably should have posted for the world to see a long time ago.

For example:

The more I dig up and paste into my WordPress archives the more I realize a few things.  First, a distinct lack of content between undergrad and grad school – I’m doing a much better job of writing without assignments now than I did then.  Second, a hard drive crash in 2003 resulted in a gap in my saved emails – this hurts more now that I’m looking back through things.  Finally, I need to make a point, for the rest of my life, to just put things out there. It seems like such a shame that I put work into these docs just to have them rot on my hard drive.

I know some of my co-workers, Reid and Wysz, have gone through the process of resurrecting old content to their current website.  Anyone else thinking about doing something similar?  What prompted you to do so?  Or, what prevented you?

Obsolescence and obscurity in digital cameras

Sunday, August 3rd, 2008

University Hall Tower at OWU I’m planning on buying a new DSLR, and as I looked through old photos from college today I started to think about my first digital camera, a Philips ESP50.  Here’s a page with some specs, translated from German.

I remember buying the camera, logged in to eBay from my parents’ house late at night the day after Christmas.  I think I ended up paying something like $250 for it.

This was before the megapixel war, when 640 by 480 was considered a viable resolution.  This camera applied tortuous levels of JPG compression to fit images on the 4MB disk.  At the time, though, it seemed like a good deal.  Film cost money, and developing film cost money, and most of the year I was a ramen-noodle-eating college student.  Probably the biggest reason to go digital was the tiny little screen on the back – you could actually tell if you got the shot, instead of waiting to get back a bunch of blurry prints.

The camera is painfully obsolete now, and even then it was somewhat obscure.  The thing is, the Web was a pretty amazing place even back in 1998 – there were lots of web pages about this camera.  I remember reading at least a couple reviews, and searches for it on WebCrawler or Alta Vista or whatever I used back then came up with retailers, other auction sites, etc.  Look for information about this camera now, and it seems that it has been largely forgotten:

And that’s about it.

I wonder, is this the destiny of all cameras?  Will I do a search for my Nikon Coolpix 5700 in 2014 and come up with just as little, or has the Web expanded so quickly that the copious product reviews, blog posts, and technical discussions on photography forums outweigh the force of entropy?  I wonder if the Internet has gained any stability as it has matured – do pages tend to stick around longer, or is linkrot a constant of the universe?

Future generations will hardly feel deprived if they miss out on information about some crappy old digicam.  Still, you never know what kind of information will end up being useful to someone at some point, and this same problem extends to all the information on the Web – from reviews of obsolete products to the human genome.  If a website goes under and deletes a thousand blogs, it won’t exactly make the news.  But our great-grandchildren might look at that stuff the way we look at letters from the Civil War.

The only solutions I have are more effort behind projects like archive.org, increased data portability, and rational intellectually property laws that don’t make saving 70-year-old content from deletion into a federal crime.

For discussion, how do you deal with ancient equipment, keeping around old web content, or even archiving old email?

New search engine – Cuil search

Monday, July 28th, 2008

Reid posted a review already, but I thought I’d add my two cents about this new search engine, Cuil.

First off, it’s great to see more companies making a serious go at web search. I don’t speak for my employer (standard disclaimers apply), but I personally am always happy to see new attempts at information retrieval on the web. More competition can only make things better for users. Heck, I’ve even cooked up a bit of a search system based on my research into IR with tagging systems and folksonomies myself, though it’s too much of a toy to release to the public.

Second, it’s a bit underwhelming to see a ton of press coverage of a new search engine, load up the site and do a simple vanity search, only to see this:

Problems with Cuil search

I know I’m not exactly the most famous person in the world, but I do have a website. Really this is just the result of scaling problems – too many people hitting this brand new service at the same time. I can’t complain too much since if I ever released my little search system, it would fail at 4 concurrent users or so. But I also don’t think I could get the amount of press that they’ve managed to get either.

Third point, I don’t know much about their architecture and algorithms but from the about us page I thought this was kind of interesting:

The Internet has grown exponentially in the last fifteen years but search engines have not kept up—until now. Cuil searches more pages on the Web than anyone else—three times as many as Google and ten times as many as Microsoft.

Do they really think the main problem of web search is too few items in the index?

If you want to read more, Read/Write Web has a good review.

Scientific proof that Reddit should add a tagging system

Tuesday, June 3rd, 2008

First, a disclaimer: the title of this post is obviously exaggerated. Proof is an awfully big word to throw around, and although I employed pretty good experiment design practices and statistical checks, I can’t really prove that Reddit should do this or that. But I can show that what they are doing now is not working, at least when it comes to search.

So, I got an email the other day letting me know that my article, Tagging and Searching: Search Retrieval Effectiveness of Folkonsomies on the World Wide Web, is being published in the July 2008 issue of Information Processing and Management (here’s the official DOI link to the article). In the study I compared search performance between traditional search engines (like Google), subject directories (like Open Directory), and social bookmarking systems (like Reddit) and their folksonomies.

What’s a folksonomy? The word is a play on the term taxonomy – a taxonomy is a system of organizing and categorizing things, like the Dewey Decimal System. Taxonomies usually follow very strict rules and are controlled by experts. A folksonomy is a system of organization built by large numbers of regular users, who add things to the collection, evaluate them, and usually tag them with keywords.

IR-system-precision-1-20

In my study, the social bookmarking systems with tagging systems did surprisingly well – Del.icio.us was more precise than Open Directory, and at a cut off of 20 results it’s precision was fairly close to that of the search engines.

Reddit, however, did not fare so well. It consistently had the lowest precision, meaning that searches returned very few relevant results. There could be many reasons for this, but the biggest difference between Reddit and the others is the lack of tags.

Now, it’s possible that the folks at Reddit have no interest in search, or information retrieval in general. I think Reddit is very effective at bringing out new and interesting links on a daily basis and encouraging commentary (just my opinion, no stats to back that up). But I think it’s a big missed opportunity not to add tagging and see where it leads.

(One last disclaimer: this post is my personal opinion as someone who enjoys using Reddit and does not reflect on my employer. This post refers to research done independently as a grad student.)

New WordPress plugin available – put tag clouds everywhere with Altocumulus

Tuesday, November 6th, 2007

If you’ve gone to any of my Category pages on this blog (my Academic papers, for example), you might have noticed I have a tag cloud with just the tags related to that category.  After I figured out how to do it I packaged it into a WordPress Plugin, called Altocumulus.

This goes along with my research interests into folksonomies and information retrieval.  I haven’t had the chance to study tag clouds empirically but my guess is that one giant tag cloud for an entire web site or blog might be more cool looking that useful for navigation.  I think that making use of tag relationships a bit more might show the strength of folksonomies for navigation.  So now, if you click to see my design pages, you can see the kinds of topics my designs cover.

For another example of this in action, take a look at Unsought Input, for example the Innovation page.

Go ahead and download version 0.1 now.   It requires WordPress 2.3 or higher.  This is my first WordPress plugin so I’m sure I’ll figure out ways to make it better over time.  If you have any bugs, pointers, or suggestions please leave them in the comments below.

Tagging and Searching: Search Retrieval Effectiveness of Folkonsomies on the World Wide Web

Wednesday, October 31st, 2007

To complete my MS in Information Architecture and Knowledge Management at Kent State I did some research on folksonomies and how the can support information retrieval.  I compared social bookmarking systems with search engines and directories.  I’m hoping to see the results published in an academic journal.   In the mean time, you can see a pre-publication copy of my results:

Tagging and searching [pdf, 989K]

Notes: Design of interfaces for information seeking

Tuesday, June 28th, 2005

Marchionini, G., & Komllodi, A.  (1998). Design of interfaces for information seeking. Annual Review of Information Science and Technology (ARIST), 21, 89-130.

In this chapter Marchionini and Komlodi examine the state of user interfaces for information seeking. Interfaces are defined as the conjunctions and boundaries where different physical and conceptual human constructs meet, and is at the center of information science in fields such as human-computer interaction (HCI and human factors. The chapter looks at advances in technology and research, summarizes the developments of the first two generations of user interfaces, and examines current (as of 1998) developments in the field. One way to look at the chapter is shown in figure 1, with technology, information seeking, and interface design research and development shifting from mainframes to PCs to the web, from professionals to literate end users to universal access, and from ASCII characters to graphics to multimedia respectively. Some early developments remain important today, such as the components of an interactive system – task, user, terminal and content (with context added later). Another milestone was the development of the GOMS (goals, operators, methods and selection) model, the first formal model of of HCI. Two themes throughout the chapter are the interdependent nature of research in this area and the importance of human-centered concepts and design.

This is a really good summary of the history of HCI with an eye specifically toward searching and information use. It’s not surprising the many of the names we have seen on articles this semester show up here as well. The only real regret I have is that there are no pictures. User interfaces often rely on visual display for interaction, so in addition to all the description it would be really interesting to see examples of the different generations of user interfaces. One other criticism is that little attention is paid the the interfaces of video games—I have read a lot of articles about interface design that ignore this field as well.

Although it is a little out of date, there’s a lot to be taken from this chapter’s historical perspective. I found three things in particular that were talked about in relationship to third-generation user interfaces that were particularly interesting. First was the move toward universal access or ubiquitous computing. It is in some ways a measure of success that researchers now worry about the lack of computers in Sub-Saharan Africa—this wouldn’t be a problem if information seeking computer interfaces were not so available, useful, and approachable. Second was the notion that the advance of the web in some ways slowed the advance of user interface design, although the apparent disadvantage quickly disappeared. This is something I’ve run into in a different form as a web designer—clients complaining that their web site did not look exactly like their brochure. Again, in some ways this was an embarrassment of riches—the web site cost nothing to distribute, could be found by search engines, acted as a storefront, but the lack of a particular font face was a step backward? Finally, the notion that the whole field is really interdisciplinary is important to always keep in mind.

Notes: Automatic performance evaluation of web search engines

Sunday, June 26th, 2005

Can, F., Nuray, R., & Sevdik, A. B. (2004). Automatic performance evaluation of web search engines. Information Processing & Management, 40(3), 495-514.

Although virtually all Internet users utilize search engines to find information on the web evaluation of search engines is often difficult. A large number of searches would need to be tested and each one would need to be judged subjectively by human participants. The authors of this paper have devised a new way to test search engines and have tested their method against evaluations done by human judges, and found their automatic Web search engine evaluation method (AWSEEM) significantly predicted the subjective judgments. In the human-evaluation control, users were given a list of resources called up by the various search engines with no idea which engine each came from and were asked to rank the relevance of each. In AWSEEM, each query was run and the top 200 results for each engine were compiled into a collection of vectors which are then ranked by their similarity to the “the user information-needs” (including the question, the query, and a description of the need). The system then looks at the top 20 ranked pages for each engine and counts how many are in the top s (50 and 100 are used) commonly retrieved pages. These are assumed to be relevant.

One possible issue with this system is that it requires a little more human interaction than first assumed—the query providers must provide more than just a query. A bigger issue, though, is the choice of measure for relevancy. AWSEEM assumes that if a result appears in the results of multiple engines, it is relevant. This may be reasonable, but does raise the question—what if all the engines studied are wrong? For a simple example, searching for my own name online will retrieve a large number of results that are the same in many search engines but have nothing to do with the particular Jason Morrison who sits here typing this. Another interesting thing to note is that they did not find much of a statistically significant difference between the performance of the different search engines using either method (although more so with the human-judgment method). Very few scholarly articles (and even fewer popular press articles) bother to do this when pitting search engines against each other. Is it possible that the very notion of the “best” search engine has been statistically meaningless for some time?

The authors make a good point about the difficulty in using real users for search engine evaluation. An automated approach is one answer, but there is another—the problem is that too much time and effort is required of a small number of users. Instead, if tiny amounts of time and effort were spread across thousands or millions of users, similar results could be achieved while still using subjective measures. For example, if every time a user got results on any search engine they were presented with a simple “rate these results on a scale of 1 to 5 stars” input, they could quickly and effortlessly contribute data toward a shootout-type study. Cooperation of the search engines would not necessarily be needed, if one could use a university’s proxy to substitute or add the input for popular search engines, for example, or if a generic search page was set up to produce results from randomized (double-blind) engines. It would be interesting to try this, AWSEEM, and individual evaluation in one study to see if there was a statistical correlation.

Notes: Why are online catalogs still hard to use?

Wednesday, June 22nd, 2005

Borgman, C.L. (1996). Why are online catalogs still hard to use? Journal of the American Society for Information Science, 47 (7): 493-503. 

In this 1996 study, Borgman revisits a 1986 study of online library catalogs. In the original study, computer interfaces and online catalogs were still fairly new—the study looked at how the design of traditional card catalogs could inform the design of new online catalogs. By the time of this study online catalogs were common but still not easy to use. Three kinds of knowledge were seen as necessary for online catalog searching: conceptual knowledge about the information retrieval process in general, semantic knowledge of how to query the particular system, and technical knowledge including basic computer skills. Semantic knowledge and technical knowledge differ here in the same way as semantic and syntactic knowledge in computer science. The study also covers specific concepts like action, access points, search terms, boolean logic, and file organization. In the short term, Borgman recommends training and help facilities to help users gain the skills they need to use current systems. In the long run, though, libraries must employ the findings of information-seeking process research if they are ever going to create usable interfaces.

The study does point out a number of reasons why online catalogs are difficult for users, whether it’s because they lack computer skills or semantic knowledge. One good example is from a common type of query language. Even if the user knows that “FI” means “find” and “AU” means author, they may not know whether to use “FI AU ROBERT M. HAYES,” “FI AU R M HAYES,” “FI AU HAYES, ROBERT M,” etc., and how the results will differ. Unfortunately the article lacks clear instructions or examples of how to make the systems better. The conclusion that different types of training materials could be helpful seems to me like a bandage rather than a cure.

I think a lot of the criticisms are still true, but that modern cataloging and searching systems have become easier. I’m not so sure it’s because catalog designers have started applying information-seeking research in their interfaces, though. It almost seems like library systems are being made easier in self-defense. Users are getting more and more used to a Google or Yahoo type interface—a simple search box that looks at full text and uses advanced algorithms to find relevant results. I think part of this is due to the fact that people in the library field have experience with complicated, powerful structure search systems and are used to a lot of manual encoding of records. Web developers, lacking this background, have been more free to think in terms of searching massive amounts of unstructured data and automating the collection and indexing process. I also think that simple things such as showing the results, including summaries of each item, in a scrollable, clickable list, have helped a great deal to support the information seeking process. Things like search history and “back” and “forward” buttons, “search within these results,” automatic spell checking, etc. are becoming pretty standard as well.