Tag Archives: methodologies

case studies information-retrieval information needs information use relevance research search-engines

Notes: Automatic performance evaluation of web search engines

Can, F., Nuray, R., & Sevdik, A. B. (2004). Automatic performance evaluation of web search engines. Information Processing & Management, 40(3), 495-514.

Although virtually all Internet users utilize search engines to find information on the web evaluation of search engines is often difficult. A large number of searches would need to be tested and each one would need to be judged subjectively by human participants. The authors of this paper have devised a new way to test search engines and have tested their method against evaluations done by human judges, and found their automatic Web search engine evaluation method (AWSEEM) significantly predicted the subjective judgments. In the human-evaluation control, users were given a list of resources called up by the various search engines with no idea which engine each came from and were asked to rank the relevance of each. In AWSEEM, each query was run and the top 200 results for each engine were compiled into a collection of vectors which are then ranked by their similarity to the “the user information-needs” (including the question, the query, and a description of the need). The system then looks at the top 20 ranked pages for each engine and counts how many are in the top s (50 and 100 are used) commonly retrieved pages. These are assumed to be relevant.

One possible issue with this system is that it requires a little more human interaction than first assumed—the query providers must provide more than just a query. A bigger issue, though, is the choice of measure for relevancy. AWSEEM assumes that if a result appears in the results of multiple engines, it is relevant. This may be reasonable, but does raise the question—what if all the engines studied are wrong? For a simple example, searching for my own name online will retrieve a large number of results that are the same in many search engines but have nothing to do with the particular Jason Morrison who sits here typing this. Another interesting thing to note is that they did not find much of a statistically significant difference between the performance of the different search engines using either method (although more so with the human-judgment method). Very few scholarly articles (and even fewer popular press articles) bother to do this when pitting search engines against each other. Is it possible that the very notion of the “best” search engine has been statistically meaningless for some time?

The authors make a good point about the difficulty in using real users for search engine evaluation. An automated approach is one answer, but there is another—the problem is that too much time and effort is required of a small number of users. Instead, if tiny amounts of time and effort were spread across thousands or millions of users, similar results could be achieved while still using subjective measures. For example, if every time a user got results on any search engine they were presented with a simple “rate these results on a scale of 1 to 5 stars” input, they could quickly and effortlessly contribute data toward a shootout-type study. Cooperation of the search engines would not necessarily be needed, if one could use a university’s proxy to substitute or add the input for popular search engines, for example, or if a generic search page was set up to produce results from randomized (double-blind) engines. It would be interesting to try this, AWSEEM, and individual evaluation in one study to see if there was a statistical correlation.

Notes: Looking for information

Case, D.O. (2002). Looking for information: A survey of research on information seeking, needs, and behavior.  New York: Academic Press.  Chapter 9: Methods: Examples by type.

In this chapter Case reviews the different methodologies employed by research studying information seeking, use, and sense-making. Although he notes a few overall studies that cast a wide net, finding overall proportions, this article is not a survey of all the literature. It instead gathers relevant examples of different types of research. The types of research included case studies, formal and field experiments, mail and Internet surveys, face-to-face and phone interviews, focus groups, ethnographic, and phenomenological studies, diaries, historical studies and content analysis. The were also multiple-method studies and meta-analysis. Case writes about some of the limitations of the different methodologies—for example, case studies have limited variables, focusing on one item or event to the exclusion of others, and they are limited in terms of time as well. The author concludes that most studies assume people make rational choices and that specific variables are more important than context. More qualitative measures are becoming more popular but cannot be generalized.

The author did a particularly good job in finding studies to examine. The best example of this are the experiments. Very few laboratory experiments have been conducted specifically on information use, but there have been many on consumer behavior—and here we consumer behavior studies that involved information gathering for decision making. Another choice I found particularly interesting was the historical research by Colin Richmond that looked at the dissemination of information in England during the Hundred Years’ War. Usually when I think of historical research in social science I think of things like comparing content analysis of newspapers of the 1950s and today. It was interesting to see thing from a historian’s point of view, and also a good reminder that people did not just start needing information with the invention of the Internet. A good, though dense, book on this topic is A Social History of Knowledge by Peter Burke.

The most immediate application of this chapter is in suggesting methodologies to use in different situations. When I’m doing research, I tend to have a bias toward sources that conducted experiments or did survey research. Reading through these cases reminded me of the usefulness of things like case studies and content analysis. Another interesting application of the chapter is in suggesting topics for further study. Although the author doesn’t really build to any general conclusion on the research topics at hand (there is no overall theme to the research) looking at the different conclusions of the different types of studies suggests some interesting questions. For example, since the study by Covell, Uman and Manning suggested that doctors report using books or journals first but in reality turn to colleagues first, how can we reexamine the studies that relied on self-reporting, such as the case study or the surveys? Perhaps some of the tactics used in the consumer research experiments would be a valuable addition.