Posts Tagged ‘Taxonomies’

Scientific proof that Reddit should add a tagging system

Tuesday, June 3rd, 2008

First, a disclaimer: the title of this post is obviously exaggerated. Proof is an awfully big word to throw around, and although I employed pretty good experiment design practices and statistical checks, I can’t really prove that Reddit should do this or that. But I can show that what they are doing now is not working, at least when it comes to search.

So, I got an email the other day letting me know that my article, Tagging and Searching: Search Retrieval Effectiveness of Folkonsomies on the World Wide Web, is being published in the July 2008 issue of Information Processing and Management (here’s the official DOI link to the article). In the study I compared search performance between traditional search engines (like Google), subject directories (like Open Directory), and social bookmarking systems (like Reddit) and their folksonomies.

What’s a folksonomy? The word is a play on the term taxonomy - a taxonomy is a system of organizing and categorizing things, like the Dewey Decimal System. Taxonomies usually follow very strict rules and are controlled by experts. A folksonomy is a system of organization built by large numbers of regular users, who add things to the collection, evaluate them, and usually tag them with keywords.

IR-system-precision-1-20

In my study, the social bookmarking systems with tagging systems did surprisingly well - Del.icio.us was more precise than Open Directory, and at a cut off of 20 results it’s precision was fairly close to that of the search engines.

Reddit, however, did not fare so well. It consistently had the lowest precision, meaning that searches returned very few relevant results. There could be many reasons for this, but the biggest difference between Reddit and the others is the lack of tags.

Now, it’s possible that the folks at Reddit have no interest in search, or information retrieval in general. I think Reddit is very effective at bringing out new and interesting links on a daily basis and encouraging commentary (just my opinion, no stats to back that up). But I think it’s a big missed opportunity not to add tagging and see where it leads.

(One last disclaimer: this post is my personal opinion as someone who enjoys using Reddit and does not reflect on my employer. This post refers to research done independently as a grad student.)

Sphere: Related Content

Tagging and Searching: Search Retrieval Effectiveness of Folkonsomies on the World Wide Web

Wednesday, October 31st, 2007

To complete my MS in Information Architecture and Knowledge Management at Kent State I did some research on folksonomies and how the can support information retrieval.  I compared social bookmarking systems with search engines and directories.  I’m hoping to see the results published in an academic journal.   In the mean time, you can see a pre-publication copy of my results:

Tagging and searching [pdf, 989K]

Sphere: Related Content

Notes on “Vocabulary as a central concept in Information Science” and additional readings

Thursday, March 18th, 2004

Vocabulary as a Central Concept in Information Science, Michael Buckland (1999)

The role of classification in knowledge representation and discovery, BH Kwasnik - Library Trends, 1999

 

One good point in the Buckland article was that vocabulary can differ between those who are doing the cataloging, the authors and the searcher, even if everyone is within the same field. I’ve read some about these differences before, but they almost always seem to take the form of novice searcher vocabulary vs. expert author vocabulary or natural searcher vocabulary vs. structured system vocab. Those are probably the most clear ways to look at these distinctions—to tell you the truth looking at subtle differences between five different vocabularies does not seem like that much fun to me.

This article gets back to some of the same points we’ve already discussed in class when talking about synonym rings and taxnomies. Even through the author comes at it from a vocabulary point of view, he’s saying the same things everyone else is. If your users want to search for “Vietnam War” but your system uses “Vietnam Conflict,” without pointing the user in the right direction, no purpose has been served. You can be as correct and specific in your phrasing as you want but that’s no guarantee you’ll have a usable system.

The Kwasinik reading was really good at pointing out the strengths and weaknesses of hierarchies, trees and other organization schemes. In doing the AG assignment I ran into the “Lack of complete and comprehensive knowledge” barrier quite often. That’s one of the biggest problems with not just hierarchies, but any project like this where we have some knowledge of the domain—everyone has seen greeting cards—but not of the entire body of AG’s product line or even a representative subset. I wouldn’t want to construct a taxonomy of content object before people started entering data—I would have it be built as the database grew, with specific people in charge of keeping it consistent.

Sphere: Related Content

Knowledge Organization System for a Greeting Card Company’s Design Studio Archives

Thursday, March 18th, 2004

Note: this was a project for a graduate course in Knowledge Organization Systems

Introduction

The goal of this project is to create a Knowledge Organization System (KOS) for a Greeting Card Company Studio archive so that designers are able to find source artwork and previous designs. This is no small task–Greeting Card Company has been in operation for nearly 100 years and has at least partial archives from the entire period, and today the company employs hundreds of designers and produces thousands of products. There is no question that without an inclusive, accurate, and easy-to-use archive, designers are unable to build on each others ideas and a great deal of work is being duplicated. Also, intellectual property needs to be properly managed and licensed artwork needs to be tracked and protected from accidental misuse.

Currently, all archives are stored in protective containers in the Studio, shelved by year. In addition a vast number of digital files have been compiled on the Studio’s serves and CD and tape backups. This project does not address the physical process of collection and digitization, but instead offers a road map to how items will be classified as they are entered into the system. This KOS also provides a framework for the database and the ultimate user interface.

Below is an analysis of the users and groups, followed by a description of the overall structure of the KOS. After that is a description of each facet, followed by pick lists, synonym rings, and taxonomies for each where applicable.

 

Users

In this analysis three distinct user groups were identified: Archivists, Designers, and Management/Administration. Archivists include the companies current information professionals as well as the interns and temp workers who will be doing the digitization and data entry under their supervision. The KOS has been set up under the assumption that most data entry personnel will be able to properly classify perhaps 80 to 90 percent of all items within each facet, forwarding the rest to more skilled information professionals. The professionals include skilled librarians, art historians, and other researchers who should be adequately prepared to train data entry personnel and classify more difficult items.

The designer group includes artists and graphic designers of varying skill and experience. Nearly all, however, have completed at least a two-year program and the majority have completed a four-year college degree. Taxonomies were developed with this level of expertise in mind. Designers were surveyed and a wide range of thinking about art objects and designs were found. The facets below were designed to cover virtually every way in which a designer might want to look for a piece.

Management and administration also have specific needs. It is for them primarily that the Designer entity described below as well as most facets dealing with licensing and sales have been created.

 

Organization

The archive needs to be broken down into four different logical entities: Art Elements (such as clip art, photographs, sculptures, etc.), Products (such as individual greeting cards, e-cards, etc.), Digital Files, and Designers. Each entity will have a number of associated facets which roughly correspond to the fields in the database and will allow multiple methods of search and organization.

The entity relationships will be defined in the database so that searches will cascade upward. For example, some searching for art elements will be able to find those done by a specific AG department, because Art Elements are related to products which are related to Designers, who have the Department/Team facet. All of this is relatively simple to do with SQL and can be hidden in the interface to make searching easier.

Each facet has an associated type, whether that be a simple constraint on an open text field, a pick list, or a taxonomy. Where lists and taxonomies have been developed the list’s page number is noted as well.

View the KOS, including the entities and their facets, pick lists, and taxonomies [pdf]

Sphere: Related Content

Notes on “A Taxonomy Primer,” “Ten Taxonomy Myths,” and additional readings

Thursday, March 4th, 2004

A Taxonomy Primer, Warner, Amy J. (2002)

Ten Taxonomy Myths, Montague Institute (2002)

The Intellectual Foundation of Information Organization By Elaine Svenonius (2002)

 

The Taxonomy Primer was pretty straightforward, but the Myths were more interesting. I especially liked myths 1 and 2, because I think when most people think taxonomy they think of a single, giant, all-encompassing tree that everything fits into exactly. It can be very useful to have a number of taxonomies for the same information, and there are some great examples on the web, where a site my be organized by product type but then also by region or customer group, allowing browsing from each perspective.

One image I found particularly enlightening was in the Svenonius article, where taxonomies were described as “elaborate Victorian edifices” and contrasted with “jerrybuilt systems [that] could meet the needs of most users most of the time.” This is an excellent description of where library people and web people seem to have a disconnect. Coming at thing more from the web side myself, I often think of grand schemes to classify everything and put everything into neatly labeled boxes—like Dewey or the Library of Congress Classification Schemes—as too big, too elaborate, and too old. I this is why many of the people who first started organizing information on web sites and the like don’t look to library science for inspiration, despite the wealth that is there. Most of the web people have only worked with systems that are small enough to be informal, personal enough to be ideosyncratic, or targeted enough to simply model how current users talk about the information already. In other words, jerrybuilt.

Later in the chapter, though, the writer states that organizing information is different from organizing anything else, and is in particular not to be done with “routine application of the database modeling techniques” used in business. While I agree that organizing information would be substantially different from organizing employees, the rationale given (something to do with works and differences in editions of them) lends itself really well to more-or-less common relational database structures. I think there are important issues, but too often the issues I see brought up are superficial.

Sphere: Related Content