Tag Archives: search-engines

Blog catalogs delicio.us folksonomies Google information-architecture information-retrieval keywords Reddit scam search social-bookmarking Taxonomies Web2.0 webspam

Is This A Scam? Find out with a Google Custom Search Engine

A search engine for scamsIn my Google Blog article about avoiding get-rich-quick scams, I recommended doing a web search to see what other people are saying about any site you’re unsure about. The internet is a big place – chances are if it’s a scam, someone else has already fallen for it and they’re already complaining on their blog or in a forum somewhere.

The only problem with doing a general web search is that not every site on the web is guaranteed to have good information. Some forums are more useful than others, and in the worst cases scammers and spammers spend lots of time trying to get their stuff in the index too.

So, I’ve created something to make it a little easier: a Google Custom Search Engine called Is This A Scam?

Wondering about a home business proposition? Drop a query here. Does your uncle keep falling for pyramid schemes? Send him this link and make him promise to search before he writes the next check.

Custom Search Engines are very useful and are incredibly easy to create. You can create one for your site, or one covering many sites under a certain topic, and you can even make money via AdSense For Search.

This particular search engine works well because I combed the web looking for high-quality sources of information about scams, fraud, snake oil, and consumer protection. The list well over 100 sites, including forums, blogs, news media, government agencies, and non-profit organizations. I’ll post the list here when I get chance.

If you’d like to volunteer to help out with this effort contact me. By the way, this isn’t an official Google product or service, just me in my free time using Google’s great CSE system, so the standard disclaimer applies.

Got bad results? No results? Have you seen a page in the results that has no business being there? Let me know in the comments below.

New search engine – Cuil search

Reid posted a review already, but I thought I’d add my two cents about this new search engine, Cuil.

First off, it’s great to see more companies making a serious go at web search. I don’t speak for my employer (standard disclaimers apply), but I personally am always happy to see new attempts at information retrieval on the web. More competition can only make things better for users. Heck, I’ve even cooked up a bit of a search system based on my research into IR with tagging systems and folksonomies myself, though it’s too much of a toy to release to the public.

Second, it’s a bit underwhelming to see a ton of press coverage of a new search engine, load up the site and do a simple vanity search, only to see this:

Problems with Cuil search

I know I’m not exactly the most famous person in the world, but I do have a website. Really this is just the result of scaling problems – too many people hitting this brand new service at the same time. I can’t complain too much since if I ever released my little search system, it would fail at 4 concurrent users or so. But I also don’t think I could get the amount of press that they’ve managed to get either.

Third point, I don’t know much about their architecture and algorithms but from the about us page I thought this was kind of interesting:

The Internet has grown exponentially in the last fifteen years but search engines have not kept up—until now. Cuil searches more pages on the Web than anyone else—three times as many as Google and ten times as many as Microsoft.

Do they really think the main problem of web search is too few items in the index?

If you want to read more, Read/Write Web has a good review.

Scientific proof that Reddit should add a tagging system

First, a disclaimer: the title of this post is obviously exaggerated. Proof is an awfully big word to throw around, and although I employed pretty good experiment design practices and statistical checks, I can’t really prove that Reddit should do this or that. But I can show that what they are doing now is not working, at least when it comes to search.

So, I got an email the other day letting me know that my article, Tagging and Searching: Search Retrieval Effectiveness of Folkonsomies on the World Wide Web, is being published in the July 2008 issue of Information Processing and Management (here’s the official DOI link to the article). In the study I compared search performance between traditional search engines (like Google), subject directories (like Open Directory), and social bookmarking systems (like Reddit) and their folksonomies.

What’s a folksonomy? The word is a play on the term taxonomy – a taxonomy is a system of organizing and categorizing things, like the Dewey Decimal System. Taxonomies usually follow very strict rules and are controlled by experts. A folksonomy is a system of organization built by large numbers of regular users, who add things to the collection, evaluate them, and usually tag them with keywords.

IR-system-precision-1-20

In my study, the social bookmarking systems with tagging systems did surprisingly well – Del.icio.us was more precise than Open Directory, and at a cut off of 20 results it’s precision was fairly close to that of the search engines.

Reddit, however, did not fare so well. It consistently had the lowest precision, meaning that searches returned very few relevant results. There could be many reasons for this, but the biggest difference between Reddit and the others is the lack of tags.

Now, it’s possible that the folks at Reddit have no interest in search, or information retrieval in general. I think Reddit is very effective at bringing out new and interesting links on a daily basis and encouraging commentary (just my opinion, no stats to back that up). But I think it’s a big missed opportunity not to add tagging and see where it leads.

(One last disclaimer: this post is my personal opinion as someone who enjoys using Reddit and does not reflect on my employer. This post refers to research done independently as a grad student.)