Posts Tagged ‘search’

Beautiful Soup Blog controlled-vocabulary Custom Search Engine Dialog fraud Google Google Moderator Google Scholar GUI how-to HTTP headers Javascript MySQL screen scraping unobtrusive javascript web-development web search webspam

How my site disappeared from Google search

Wednesday, February 24th, 2010

Seen my personal blog lately? Probably not, if you were searching via Google. Major sections of my site have been disappearing from the search index over the past three weeks. My homepage, my blog and many of the most recent articles on it no longer showed up in result pages. I’m no Matt Cutts, but I get a fair number of people coming to my site when searching for info about Google search, avoiding scams, and how to name their baby. All that traffic has been slipping away.

You can probably imagine how you would feel if this was happening to you. Does Google hate me? Was my site hacked? What do I do, and how much will it cost to get this fixed?

I will answer all of those questions, starting with the first:

My site is falling out of the index, does Google hate me?

Probably not. My situation is actually pretty illustrative – I’m pretty sure Google doesn’t hate me and isn’t unfairly slapping my site down because, well, I work at Google.

That’s right, Google was kicking pages from one of its own employees out of search results. I’m sure I’m not the first. Google doesn’t treat my site any differently than anyone else’s. BTW, standard disclaimers apply to this post.

So I knew there was probably a logical reason for the dropped pages, which brings me to the next question:

(more…)

How to get Google search results for academic research

Tuesday, January 12th, 2010

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.

Is This A Scam? Find out with a Google Custom Search Engine

Monday, July 20th, 2009

A search engine for scamsIn my Google Blog article about avoiding get-rich-quick scams, I recommended doing a web search to see what other people are saying about any site you’re unsure about. The internet is a big place – chances are if it’s a scam, someone else has already fallen for it and they’re already complaining on their blog or in a forum somewhere.

The only problem with doing a general web search is that not every site on the web is guaranteed to have good information. Some forums are more useful than others, and in the worst cases scammers and spammers spend lots of time trying to get their stuff in the index too.

So, I’ve created something to make it a little easier: a Google Custom Search Engine called Is This A Scam?

Wondering about a home business proposition? Drop a query here. Does your uncle keep falling for pyramid schemes? Send him this link and make him promise to search before he writes the next check.

Custom Search Engines are very useful and are incredibly easy to create. You can create one for your site, or one covering many sites under a certain topic, and you can even make money via AdSense For Search.

This particular search engine works well because I combed the web looking for high-quality sources of information about scams, fraud, snake oil, and consumer protection. The list well over 100 sites, including forums, blogs, news media, government agencies, and non-profit organizations. I’ll post the list here when I get chance.

If you’d like to volunteer to help out with this effort contact me. By the way, this isn’t an official Google product or service, just me in my free time using Google’s great CSE system, so the standard disclaimer applies.

Got bad results? No results? Have you seen a page in the results that has no business being there? Let me know in the comments below.

How to link to an individual question in Google Moderator

Saturday, March 28th, 2009

The Obama administration’s just finished “Open for Questions“, where the President answered questions suggested and voted by the general public over the web. This is pretty cool – political openness, interaction, and democracy via the web. It’s also interesting to me because the site uses Google Moderator, a product we use at work all the time.

What’s not quite so cool is that Moderator apparently doesn’t play well with the rest of the web. I’m not sure why it was designed this way (and if I did know, I probably couldn’t tell you anyway). The design is the exact opposite of unobtrusive javascript. That’s fine for highly interactive web apps but it would be nice to see the mostly text content in Moderator made searchable just like any other collection of web pages.

(more…)

The pain of Dialog: Flat file vs. relational databases

Monday, October 6th, 2003

After my first expose to the Dialog structured search system I wanted to put down some thoughts about relational databases.  There may very well be reasons why flat-file text databases are better for systems like Dialog or OhioLink, but I really don’t think they are the ones I’ve heard mentioned in class.

The first major point made in class was that in a flat file database like those in dialog the creators could do something like this for a record with multiple authors:

TI = Title of this article
AU = Smith, Bob B
AU = Jones, Joseph H
AU = Fakename, Robert P
etc…

Whereas in a relational database the table would have to have fields like this:

Table Article
——————
Title
Author1
Author2
Author3
etc…

Although I have seen databases designed exactly as described, that that design defeats the entire point of having a “relational” database–relationships.  A better design would be to break Articles and Authors into two separate tables, since they are two separate entities, and because they have a many-to-many relationship (any number of authors can write any number of articles) a link table would be made as well:

Table Article
——————
Article_id
Article_title

Table Author
——————
Author_id
Author_name

Table Article_Author
——————
Article_author_id
Article_id
Author_id

This is a better approach than the flat file database as well, because it means an author only needs to be entered once, and that the author record only exists in one place.  If a user is typing up hundreds of citations a day, it is likely they will misspell an author name once and a while–with the relational database, they would be picking from a dropdown or using some other method to select the author record that already exists.  Also, it allows for changes to be made easily.  Imagine if a prolific author has adopted a stage name and gained notoriety–now the author record can be changed only once and the changes will be reflect every time an article record–joined to the author table–is called up.

But what if some users will still search for the author’s old name?  There are a number of approaches the database designer could take, for example creating a new Author_aliases table that links to the correct record in the Author table, etc.  Also, the tables above are highly simplified.  It is doubtful the author table would have a field for name–most likely it would have fields for first, middle, and last name and any other pertinent information as well.  That, and proper construction of the interface, would eliminate such silliness as having to type Lastname, First in one place and Lastname First Initial in others.

There are a number of good tutorials on this subject online, for example:
http://www.phpbuilder.com/columns/barry20000731.php3?page=1
http://www.serverwatch.com/tutorials/article.php/1549781
http://builder.com.com/5100-6388-1050841.html

The second issue brought up in class was the need for unlimited field size.  I have also seen relational databases where the designer only allowed 5 characters for a field that after a year really needed 10, but again this is poor design.  Relational databases, at least for the last ten years or so, have been able to handle more or less unlimited field sizes.  MySQL, which is available for free, is a good example.  (http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#Column_types) For numerical data, the bigint column type has a range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 or 0 to 18,446,744,073,709,551,615 unsigned, and the varchar type supports up to 255 characters.  If you need more room than that, the longtext type supports up to 4,294,967,295 characters.  War and Peace in ASCII from the Gutenberg project, is just over 3 million characters.  How often do you need to store more than 1,000 copies of war and peace in one column of one record?  Expensive commercial databases like Oracle no doubt have even more impressive figures.

There are reasons why most of the largest web sites with the most traffic and largest amount of information use relational database backends, and not CSV files or even XML for storage and retrieval.  XML is great at moving, exchanging, and marking up information for display.  But from what I’ve read most people seem to agree it’s not great for storage of anything of real size, or anything that needs to be accessed very often and very quickly.

The more I use Dialog the less impressed I am by it.  It strikes me very much as a tool that was amazing in its day, but severely limited by available hardware.  Now that hardware is ridiculously fast and cheap, its limitations are purely artificial and indistinguishable from clunky design.  I know many people who swear by command-line interfaces, and I know there are studies showing CLI to be more “efficient” for the most expert of users, but there has to be a reason why 99% of the world uses Windows or MacOS (or Gnome or KDE even if they run Linux).  If it takes a year for most users to reach expert status and reap the efficiency benefits, but users can master a 20 percent less efficient GUI in a week, which is better?  And I say that from the point of view of someone who used DOS for years and is comfortable coding in notepad.  And it’s not as if creating a GUI means you have to abandon the CLI completely–both can happily coexist.

There are some major structural problems.  The fact that Author name entry isn’t standard across Dialog is nonsensical.  I understand where some fields in chemistry databases will differ from fields in business databases, but nearly everything will have an author, and all that do should conform to a standard.  The advantages of controlled vocabulary dwindle when nothing is well controlled.  Descriptors differ from one database to another, may or may not be updated, etc.  A well-designed relational database would help to eliminate these sorts of problems.

It would not be hard to make a system like Dialog with a relational database.  Give a decent programmer or dba complete access to Dialog’s data and a year full-time, and I bet they could come up with something.  The biggest problem would be trying to reconcile all the weirdness of the individual databases, like truncating hyphenated names and such.   Designing the tables and fields in MySQL for ERIC and a couple others would be a fun little project that would take less than a week.

I wonder – has there been any effort to bring Dialog into the current decade, or even the 1990s?  How many people still actually use it, with so many library and journal catalogs going online?  I know Medline is available elsewhere.  I asked a friend of mine majoring in LS at Pittsburgh and she said one professor showed it to them in one class, but no one ever actually used it.  I understand the difference between being able to search fields vs the web, controlled vocab, etc., but surely there’s a less aggravating system out there that includes these features?

I mean, just the whole bluesheet thing…  searching with the Find on this Page feature of your browser?  You should never rely on your visitors to have a specific browser feature, and search boxes aren’t too hard to do.  The Dialog Database Catalog is a series of randomly chopped up PDFs?  And none of this is integrated into any of their telnet-workalike interfaces?