Friday, October 17, 2008

Troubles with Tracking

[ The article below was originally published on WebMonkey in 1998, but Lycos has moved WebMonkey to a wiki and hasn't moved all of the old articles ;^

Note that it assumes that web content is made up of static pages. This is becoming less and less the case as interactivity and personalization is enabled. Industry players, such as the Internet Advertising Bureau, are now focusing on metrics for this new paradigm.
]

Troubles with Tracking

My last two articles discussed tracking: The first covered what you can track, and the second dealt with how you can track over time. In this article, I'm going to show you what you can't do by thoroughly demoralizing you with some of the limitations of your available information.

No, I'm not a sadist, but it's best that you know what problems you'll be facing, as well as some possible work-arounds. So now, when your boss or a customer asks you why you can't give them exact information, you can point them to this article.

Counting Pageviews

The number of pageviews you count is not the actual number of pageviews of your site. "How can this be?" you ask. "I'm simply counting records in my Web server's access log." Well, the fact is a lot of requests never make it to your access log.

First, browsers - at least Netscape and Internet Explorer - have caches. If a person requests a page from your site and soon requests it again, the browser may not go back to your server to request the page a second time. Instead, it may simply retrieve it from its cache. And you would never know. You can try using "expires" or "no cache" tags to stop browsers from caching your pages, but you can never be sure if your tags are read or not.

Second, let's say that a user's browser doesn't retrieve your page from its cache but actually re-requests the page from your server. Many ISPs use proxy servers, and proxy servers cache pages just like browsers. If a person using an ISP with a proxy server makes a request, the proxy server first checks its cache. If the page is there, it serves that page to the person, instead of going to your server. And you would never know.

Again, you can try using the tags I've described above, but there's no Proxy Server Police making sure proxy servers respect your tags.

Another tracking obstacle are bots, or spiders. These software programs scour the Web, either cataloging pages for search engines or looking for information for their owners.

Do you care if your pageview counts include hits from bots? If you do care, then you'd better find a way to ignore these hits. You can create a list of IP addresses to ignore, but with new bots born every day, the list will always be one step - or 100 steps - behind. Similarly, you can use the requester's user-agent string, but there's nothing keeping developers from sending any old string they please. Lastly, you can take a daily count of the hits and just ignore repeat hits from the same IP address if their total number passes some threshold. Then you run the risk of accidentally ignoring hits from an ISP that uses a proxy server and sends its own IP address - instead of a different IP address for each user.

With no perfect solutions, it's up to you to decide which method you can learn to live with.

Counting Visitors

So, if you're not able to accurately record every single request, of course you can't get a full count of your site's visitors. And that's not your only problem.

I discussed some tracking issues before. One problem I didn't discuss concerns cookies and new visitors. Let's say that you want to count the number of visitors you had yesterday, and you use the methodology we discussed previously.

When a person visits your site for the first time, they don't yet have a cookie, and their request will arrive without one. Your Web server promptly sends the visitor a new cookie along with the requested page. Now, say the visitor then requests a second page from your site. And this time the visitor's request does come with a cookie, so the record of the visitor's hit will have a cookie.

When you use your Perl script (or whatever) to count visitors, you first count membernames, if you allow people to authenticate. For hits that don't have membernames, you count cookies. Lastly, for hits that don't have membernames or cookies, you count remote IP addresses.

But this process double-counts new visitors. A visitor's first hit won't have a cookie or membername, and so its IP address will be counted. The same visitor's following hits will be counted either with the count of membernames or cookies.

At Wired Digital, we handle this by logging the times a cookie is sent, yet we don't receive a cookie. Every night, we look for hits that contain a cookie sent. For each one, we check for other hits with a received cookie equal to that sent cookie. If we find any, we move the cookie-sent value into the cookie-received field before we load the hit into our data warehouse. When we use our counting methodology, this person will be counted just once.

Note that we don't simply merge the cookies sent with the cookies received. Doing this would multi-count people who have disabled cookies.

OK, let's say you have more than one domain out of which you serve hits: for example, uvw.com and xyz.com. You can count the number of visitors who go to uvw.com, and you can count the number who visited xyz.com, but the total number will almost certainly not equal the sum of the two.

Why can't you get this number? Let's say a visitor comes to uvw.com. The visitor doesn't have a cookie, so your Web server sends one. Let's say the visitor then goes to xyz.com. The visitor's browser won't send the uvw.com cookie to the xyz.com Web server. That's verboten (see Marc's article, "That's the Way the Cookie Crumbles"). Therefore, xyz.com's Web server sends yet another cookie to the visitor, making a total of two different cookies for one visitor. And never the twain shall meet.

How do you get around this problem? As a tracking guy, I do my best to push for one primary domain. For example, uvw.common.com and xyz.common.com. This allows you to use one set of cookies.

If you can't make that happen, you've got some work ahead of you. I'm afraid I can't go into our methodology at Wired Digital (if I told you, I'd have to kill you.... yada, yada, yada), but there are ways to get around this limitation. I'll have to leave this as a take-home exercise.

Bots can also wreak havoc in this situation. If one or more bots hit you, your visitor numbers won't be affected much. But if you calculate pageviews per visitor and you ignore bots, your numbers may be skewed.

Tracking Browsers and Platforms

A browser can send your Web server any user-agent string it wants, so whatever reporting you do based on these numbers is a matter of trust. Given that the vast majority of people use Netscape or Internet Explorer, you can feel pretty confident about these numbers.

Of course, if one browser cache is better than another's, the number of pageviews you see from the former will be lower than the latter. I probably shouldn't have mentioned that: You know there's a marketing wiz at one of these companies who is asking the development team right now to turn off the browser's caching capability.

Calculating Visits/Sessions

Marketers and advertisers love the concept of the visit, i.e., how long a person stays at a site before moving on. Yet this number is impossible to determine using HTTP.

Let's say I request a page from HotBot at noon. Then I request another page from HotBot at 12:19 p.m. How long was my HotBot visit? You can never know for sure. It's possible that I stared at the first HotBot page for the full 19 minutes. But I may just as easily have opened another browser window and read Wired News for the duration of those 19 minutes. Then again, I may have walked to 7-Eleven for a Big Gulp.

Yet your customers demand this information. So, what do you tell them?

Well, you turn to the Internet Advertising Bureau [this link is now dead], which defines a visit as "a series of page requests by a visitor without 30 consecutive minutes of inactivity."

When people ask about the length of your users' visits, go ahead and tell them, based on the IAB's definition. If you feel like wasting a little time, tell them how the numbers are meaningless until your face turns blue.

Counting Referrals

If a visitor clicks on a link or a banner to get to your site, the visitor's browser will send the URL of the site he or she just left, along with the request. This URL is called the "referer."

In a successful attempt to make our lives more difficult, Netscape and Microsoft coded their browsers to handle the passing of referral information differently. Specifically, if you click on a link that takes you to a page that features frames, your Netscape browser will send the original page as the referer to the frame-set page, as well as the pages that make up each individual frame. Internet Explorer will send the original page as the referer to the outer (frame set) page, which in turn sends the URL of the outer page as the referer to the individual frames.

Check it out for yourself, and see what a difference a browser makes.

This example is made up of the following files:
  • referer.html


<html>
<body>
<a href="container.html">Click to display the frameset</a>
</body>
</html>


  • container.html


<html>
<head>
<title>Example frameset</title>
</head>
<frameset cols="50%,50%">
<frame name="left" src="env.cgi">
<frame name="right" src="env.cgi">
</frameset>
</html>


  • env.cgi


#!/usr/local/bin/perl
print <<END;
Content-type: text/html

HTTP_REFERER = $ENV{HTTP_REFERER}
END


What does this mean? Basically, if your site features frames and you want to track your referrals to a specific frame, you will have to handle each browser differently.

Are you thoroughly frustrated? If not, I admire your bright-and-sunny outlook - you should look into becoming an air-traffic controller; you'd be perfect. Otherwise, I want to remind you that even if every single piece of the tracking puzzle is a nightmare of confusion, you can assemble a picture of your site traffic. It won't be perfect - far from it - but it will provide you with enough information to get an idea of how you're doing and how you can build a better site.

No comments: