DiverseIT: Long Distance Data Tracking (i.e. longitudinal web analytics)

[
The article below was originally published on WebMonkey in 1998, but Lycos has moved WebMonkey to a wiki and hasn't moved all of the old articles ;^(

Note that it assumes that web content is made up of static pages. This is becoming less and less the case as interactivity and personalization is enabled. Industry players, such as the Internet Advertising Bureau, are now focusing on metrics for this new paradigm.
]

Long Distance Tracking

In my last article, I introduced the types of tracking information you can get from your Web server. In that article I concentrated mostly on what you can do with a single day's worth of data. Now I'm going to show you what long-range data tracking can do for you.

Some questions can only be answered by looking at your data over an extended period of time:

How fast is my number of pageviews increasing? How many pageviews should I expect by the end of the year?

Which areas of my site are experiencing the fastest pageview growth? The slowest?

How is the relative browser share changing over time?

How often do people visit my site?

Of the people who first came to my site via my ad banner on xyz.com, how many pages have they subsequently viewed?

And I'm sure that once you look at the types of information available (discussed in my previous article), you'll come up with all sorts of questions that need long-range answers.

If you're interested in answering these questions, then multi-day tracking is for you. And if you're thinking of tracking, then it's time to seriously consider a database.

Getting Down to Database-ics

You could create from-scratch programs to retrieve the information you want out of your hit logs. Of course you could also spend your life banging your head against a wall. But neither option is really in your best interest. And the more hits you get per day, the more you'll find good reasons to store your hits in a database:

If you design your database correctly, your queries will return the information you want many times faster than programs that retrieve data from log files. And the more data you have, the more you'll notice the difference in performance.

If you only store the hits that interest you (versus every single li'l ol' image request), you can significantly reduce the amount of space your data requires.

Most people use SQL (Structured Query Language) to retrieve data from databases. SQL is a small, concise language with very few commands and syntax elements to learn. Plus, the command structures are simple and well defined, so good programmers can create an SQL query much more quickly than they could code a program to do the same thing. And the resulting SQL query would be less prone to errors and easier to understand.

If you don't want to code SQL, you can use a database access tool (e.g., MS Access or Excel, Crystal Reports, or BusinessObjects) to retrieve information. Many of these tools are extremely easy to use, with a graphical, drag-and-drop interface.

You could also create your own program using one of a smorgasbord of application development tools that make creating a data-retrieving program relatively simple. Of course it's nice to know that, with most database products, you aren't prevented from writing your applications in your favorite 3GL. Many provide ODBC access as well as proprietary APIs. For example, at Wired Digital we've written our reporting application in Perl, using both Sybase's CTlib and the DBI package for database access.

On the other hand, some distinct reasons exist NOT to store your data in a database:

You actually have to implement and maintain the code for loading your data into the database.

Most databases require some resources for administration.

Most database products cost money. [Many viable open source database products have matured since I first wrote this article. See, for example, MySQL, PostgreSQL, Ingres, Firebird...]

You will have to learn SQL, or whatever language the database product you select implements.

Databases are inherently more fragile than flat files. You will have to spend more time making sure you have a good "backup and restore" plan.

Still interested in a database? Now you have to choose: 1) whether to load your hits directly into a database from your Web server, and 2) which database product to load your hits into. Note that these decisions aren't independent - it may be difficult, if not impossible, to load hits into some databases, and some databases may not allow data inserts while queries are being run against them.

The Direct Route

Loading your data directly from your Web server into a database can add all sorts of complexity to your life. If you choose this route, you have to decide whether you can live with lost data. If you can, you may skip the next few paragraphs. Otherwise, read on.

For reasons I won't go into here, higher-end database products use database managers that handle all accesses to the database. Since database managers are software programs, they can fail. So if you have your Web server load its data directly into one of these databases, and the database manager crashes, you may lose this information.

Some Web servers allow you to write code that stores the Web server's information in a log file if the database manager crashes (especially if you have the source code). Of course, in this case you will also have to design a backup process that gets information into your database for those times when your database goes down.

Pick a Database Management System

Here is a partial list of the database products available to you:

Company	Product	Comments
IBM	DB2	Never count IBM out.
Informix	Dynamic Server	Recent company financial problems, but a top-notch RDBMS. [acquired by IBM after publication of this article]
	MSQL	Shareware! Created by David J. Hughes at Bond University, Australia.
Microsoft	Access	Low-end, user-friendly RDBMS.
Microsoft	SQL Server	Mid-range RDBMS. Microsoft's tenacity continues to improve this product. [I would no longer call this "mid-range". It can now compete with the top-end db's]
NCR	Teradata	The Ferrari Testarossa of data warehousing engines ... at Testarossa prices. For very large databases. [spun out of NCR after publication of this article. http://www.teradata.com/]
Oracle	Oracle	The leading RDBMS.
Red Brick Systems	Red Brick	RDBMS designed specifically for data warehousing. This is what we use at Wired Digital. [ acquired by Informix (which was then acquired by IBM) after publication of this article]
Sybase	Adaptive Server	Number 2 in RDBMS market. We use this at Wired Digital for non-data warehouse applications. [No longer #2, but still a viable competitor]

[As I've noted above, there are many mature open source database options now available. I recommend you check them out]

After selecting a database product, you have to design the structure where your data will live. Luckily, your job will be easier than most database designers' because, in the case of Web tracking, there aren't that many different types of information to store.

Here are some goals to shoot for when you design your database:

minimize load times
minimize query times
minimize administration and maintenance
minimize database size

To achieve these goals, all sorts of decisions need to be made. For example, the time it takes to load your data will depend on how much data you want to load, whether you use "lookup" tables, whether your database is stored on a RAID system, and so on.

Also, these goals sometimes conflict. For example, to minimize query time, you may have to create and maintain summary tables. But if you do this, administration and maintenance time increases, and the size of your database grows. And as you make these database decisions, don't forget that people who look at your data will, at some point, want to audit and compare it with the data in your Web server log files.

Finally, if you have experience designing data warehouses, do a clean boot of your brain. This will be unlike any other data warehouse you have designed. For example, a merchandiser like Wal-Mart knows what products it sells and at which stores it sells them. For each product, it knows what category it belongs to, who manufactures it, and what it costs. For each store it knows which geographic region it's in, what country it's in, and its size. All of these "dimensions" are limited in the number of values they can have: when a merchandiser loads sales data into its data warehouse, it doesn't have to deal with unknown entities.

Your tracking data warehouse application, however, will constantly deal with unknowns. You don't know what domains visitors will be coming from, where referrals will be coming from, or what browsers those visitors will be using. And when your users enter information into forms, you may not know what values they'll be entering (especially if your forms contain text fields). And there's no telling how many values these "dimensions" will have.

So pick your tools wisely, and get tracking.

DiverseIT

Thursday, October 09, 2008

Long Distance Data Tracking (i.e. longitudinal web analytics)

No comments:

About Me

Blog Archive

Links

Label Cloud

Google Analytics