Monthly Archives: January 2010

Three Ways Sites Can Track Visitors Without Cookies

There’s an old joke about the Internet that’s important for two reasons. First the joke:

On the internet, nobody knows you're a dog

It’s important because it illustrates a key cultural and technological underpinning of the Internet: anonymity. The second reason it’s important is that it’s so old, printed in the New Yorker in 1993, which is basically old testament times in Internet years. So for decades, the web has allowed people to browse without telling or proving who they are. Though many sites would love if you created an account and logged in, the vast majority are perfectly happy to serve up pages to you without even knowing if you’re a person or a dog.

But there are many reasons to want to track a user from page to page or from site to site, and there are various ways to do it. The most common way involves cookies. Web developers need a way to create user sessions or else things users like (shopping carts, preferences, the ability to update your profile picture) are impossible to implement.

Cookies are pretty well understood, and users can turn them off or clear them out if they really want. Google Chrome, for example, has “Incognito Mode” which allows you to surf without saving cookies, history, etc. from session to session. Even with cookies off, though, maintaining a user session within a particular site by passing around a session id isn’t too hard. It’s trivial to do in PHP for example.

Most users are pretty comfortable with this state of affairs – Facebook knows who I am because I logged in, but I trust them. Amazon knows who I am but that’s cool because I’m shopping. Some other site doesn’t know who I am, but it knows that I’m the same person who clicked on the widget to change the language a couple minutes ago.

People start getting uncomfortable when you start tracking them across sites. People become even more uncomfortable when they no longer have control over their anonymity. Three recent techniques violate both of those comfort zones in limited ways.

1. The EFF’s Panopticlick project.

Follow the link above and click the “test me” button. Is your browser silently betraying you? This is a very clever hack based on the fact that browsers almost always send some information to web servers in http headers (the user agent, what type of content the browser is willing to accept, etc.). People have been misusing user agent headers to try to get Javascript working in multiple browsers for years. Panopticlick also checks for available plugins and fonts. Adding all this data up there’s enough variability from one browser to the next that you can apparently reliably identify individuals. The EFF has a great post on the information theory behind the project.

This doesn’t mean sites will know who you are, but they could use this information to know that you visited web page A, B and C whether or not you want them too. An ad network could use this info to track you across many sites. An unscrupulous site could sell this info, giving your browsing history away for cash, and if you log into a site that has personally-identifying info about you (email, shipping addresses, etc.) the history could potentially be tied back to a person.

Next post, I’ll talk about another way to track users without cookies and a way for a site to tell if you’ve visited other sites in the past. I’ll also tell you why you shouldn’t panic, though I admit a better writer would have told you that first.

Important post on the Google blog about Google’s future in China

If you haven’t heard, there’s big news on the Google Blog about malicious attacks and Google’s future in China. Please take a minute to read the post.

I wanted to add three things:

  1. I work for Google fighting abuse, but I’m not involved in this so I can’t tell you anything more than what you see on the blog. If I was involved, then I definitely couldn’t tell you more. Standard disclaimers apply.
  2. I am very proud to work for a company with such a commitment to openness and free speech.
  3. I’ve worked with some folks from our Beijing office, and in my experience they are smart, capable people committed to serving users and helping people get the information they are searching for. I hope everything works out for them.

 

How to get Google search results for academic research

A few years ago, before I was a Googler, I was a grad student doing research on information retrieval. I wanted to compare the results of Google and other search engines with folksonomies form social bookmarking sites. It sounds pretty simple – Google does lots of internal search quality studies, so it’s not too surprising that outside researchers would want to execute lots of queries and use the results in their data.

The way I did it was… not optimal, to say the least. I wrote a bunch of PHP code, spaced out participant sessions, etc. to make sure I could get results back. Google tries to make sure that spammers aren’t scraping search results to generate webspam, so any kind of scraping with cURL, Beautiful Soup, etc. can result in a big pile of failure.

The way I did it wasn’t the right way or the easy way, so when I got the job I made a mental note to ask around for the best way to get search results. Then I forgot all about it until an email exchange with Gary Warner of CyberCrime & Doing Time fame.

It turns out Google has a great University research program and API. You have to apply for registration and let us know who you are, what school you’re affiliated with, and what you plan to study. Assuming everyting checks out you’ll get access to a pretty nice API. There’s a some example Python code but you could just as easily use PHP, Java, or whatever to consume the XML responses.

And that research I was doing? I recently noticed that my paper has been cited 7 or 8 times, according to Google Scholar. I used to joke that I had written the least influential paper in the history of academic publishing, but I guess I can’t claim the title anymore. Scopus only shows 4 citations so I will remain humble anyway.