DiverseIT: 2008

Wednesday, November 05, 2008

Syncplicity leaving beta

Syncplicity is a great service. They provide hosted disk space, access via the web, and a Windows program that automatically synchronizes the versions of the files you select on your computer with the files stored on their servers. In fact, if you have 2 computers (for example one at home and one at work), Syncplicity can synchronize your files between all three places. But wait - there's more! You can also share the files you want to share by providing friends with url's to them.

You can get a free account that gives you 5GB of space on their servers; you can get 50GB for $10/month or $100/year; or you can get 100GB for $20/month or $200/year.

(Oh, and if you let me send you an invitation, I'll get another 1GB)

This is a great painless backup solution. Try it!

Friday, October 17, 2008

Troubles with Tracking

[ The article below was originally published on WebMonkey in 1998, but Lycos has moved WebMonkey to a wiki and hasn't moved all of the old articles ;^

Note that it assumes that web content is made up of static pages. This is becoming less and less the case as interactivity and personalization is enabled. Industry players, such as the Internet Advertising Bureau, are now focusing on metrics for this new paradigm.
]

Troubles with Tracking

My last two articles discussed tracking: The first covered what you can track, and the second dealt with how you can track over time. In this article, I'm going to show you what you can't do by thoroughly demoralizing you with some of the limitations of your available information.

No, I'm not a sadist, but it's best that you know what problems you'll be facing, as well as some possible work-arounds. So now, when your boss or a customer asks you why you can't give them exact information, you can point them to this article.

Counting Pageviews

The number of pageviews you count is not the actual number of pageviews of your site. "How can this be?" you ask. "I'm simply counting records in my Web server's access log." Well, the fact is a lot of requests never make it to your access log.

First, browsers - at least Netscape and Internet Explorer - have caches. If a person requests a page from your site and soon requests it again, the browser may not go back to your server to request the page a second time. Instead, it may simply retrieve it from its cache. And you would never know. You can try using "expires" or "no cache" tags to stop browsers from caching your pages, but you can never be sure if your tags are read or not.

Second, let's say that a user's browser doesn't retrieve your page from its cache but actually re-requests the page from your server. Many ISPs use proxy servers, and proxy servers cache pages just like browsers. If a person using an ISP with a proxy server makes a request, the proxy server first checks its cache. If the page is there, it serves that page to the person, instead of going to your server. And you would never know.

Again, you can try using the tags I've described above, but there's no Proxy Server Police making sure proxy servers respect your tags.

Another tracking obstacle are bots, or spiders. These software programs scour the Web, either cataloging pages for search engines or looking for information for their owners.

Do you care if your pageview counts include hits from bots? If you do care, then you'd better find a way to ignore these hits. You can create a list of IP addresses to ignore, but with new bots born every day, the list will always be one step - or 100 steps - behind. Similarly, you can use the requester's user-agent string, but there's nothing keeping developers from sending any old string they please. Lastly, you can take a daily count of the hits and just ignore repeat hits from the same IP address if their total number passes some threshold. Then you run the risk of accidentally ignoring hits from an ISP that uses a proxy server and sends its own IP address - instead of a different IP address for each user.

With no perfect solutions, it's up to you to decide which method you can learn to live with.

Counting Visitors

So, if you're not able to accurately record every single request, of course you can't get a full count of your site's visitors. And that's not your only problem.

I discussed some tracking issues before. One problem I didn't discuss concerns cookies and new visitors. Let's say that you want to count the number of visitors you had yesterday, and you use the methodology we discussed previously.

When a person visits your site for the first time, they don't yet have a cookie, and their request will arrive without one. Your Web server promptly sends the visitor a new cookie along with the requested page. Now, say the visitor then requests a second page from your site. And this time the visitor's request does come with a cookie, so the record of the visitor's hit will have a cookie.

When you use your Perl script (or whatever) to count visitors, you first count membernames, if you allow people to authenticate. For hits that don't have membernames, you count cookies. Lastly, for hits that don't have membernames or cookies, you count remote IP addresses.

But this process double-counts new visitors. A visitor's first hit won't have a cookie or membername, and so its IP address will be counted. The same visitor's following hits will be counted either with the count of membernames or cookies.

At Wired Digital, we handle this by logging the times a cookie is sent, yet we don't receive a cookie. Every night, we look for hits that contain a cookie sent. For each one, we check for other hits with a received cookie equal to that sent cookie. If we find any, we move the cookie-sent value into the cookie-received field before we load the hit into our data warehouse. When we use our counting methodology, this person will be counted just once.

Note that we don't simply merge the cookies sent with the cookies received. Doing this would multi-count people who have disabled cookies.

OK, let's say you have more than one domain out of which you serve hits: for example, uvw.com and xyz.com. You can count the number of visitors who go to uvw.com, and you can count the number who visited xyz.com, but the total number will almost certainly not equal the sum of the two.

Why can't you get this number? Let's say a visitor comes to uvw.com. The visitor doesn't have a cookie, so your Web server sends one. Let's say the visitor then goes to xyz.com. The visitor's browser won't send the uvw.com cookie to the xyz.com Web server. That's verboten (see Marc's article, "That's the Way the Cookie Crumbles"). Therefore, xyz.com's Web server sends yet another cookie to the visitor, making a total of two different cookies for one visitor. And never the twain shall meet.

How do you get around this problem? As a tracking guy, I do my best to push for one primary domain. For example, uvw.common.com and xyz.common.com. This allows you to use one set of cookies.

If you can't make that happen, you've got some work ahead of you. I'm afraid I can't go into our methodology at Wired Digital (if I told you, I'd have to kill you.... yada, yada, yada), but there are ways to get around this limitation. I'll have to leave this as a take-home exercise.

Bots can also wreak havoc in this situation. If one or more bots hit you, your visitor numbers won't be affected much. But if you calculate pageviews per visitor and you ignore bots, your numbers may be skewed.

Tracking Browsers and Platforms

A browser can send your Web server any user-agent string it wants, so whatever reporting you do based on these numbers is a matter of trust. Given that the vast majority of people use Netscape or Internet Explorer, you can feel pretty confident about these numbers.

Of course, if one browser cache is better than another's, the number of pageviews you see from the former will be lower than the latter. I probably shouldn't have mentioned that: You know there's a marketing wiz at one of these companies who is asking the development team right now to turn off the browser's caching capability.

Calculating Visits/Sessions

Marketers and advertisers love the concept of the visit, i.e., how long a person stays at a site before moving on. Yet this number is impossible to determine using HTTP.

Let's say I request a page from HotBot at noon. Then I request another page from HotBot at 12:19 p.m. How long was my HotBot visit? You can never know for sure. It's possible that I stared at the first HotBot page for the full 19 minutes. But I may just as easily have opened another browser window and read Wired News for the duration of those 19 minutes. Then again, I may have walked to 7-Eleven for a Big Gulp.

Yet your customers demand this information. So, what do you tell them?

Well, you turn to the ~~Internet Advertising Bureau~~ [this link is now dead], which defines a visit as "a series of page requests by a visitor without 30 consecutive minutes of inactivity."

When people ask about the length of your users' visits, go ahead and tell them, based on the IAB's definition. If you feel like wasting a little time, tell them how the numbers are meaningless until your face turns blue.

Counting Referrals

If a visitor clicks on a link or a banner to get to your site, the visitor's browser will send the URL of the site he or she just left, along with the request. This URL is called the "referer."

In a successful attempt to make our lives more difficult, Netscape and Microsoft coded their browsers to handle the passing of referral information differently. Specifically, if you click on a link that takes you to a page that features frames, your Netscape browser will send the original page as the referer to the frame-set page, as well as the pages that make up each individual frame. Internet Explorer will send the original page as the referer to the outer (frame set) page, which in turn sends the URL of the outer page as the referer to the individual frames.

Check it out for yourself, and see what a difference a browser makes.

This example is made up of the following files:

referer.html

<html>
<body>
<a href="container.html">Click to display the frameset</a>
</body>
</html>

container.html

<html>
<head>
<title>Example frameset</title>
</head>
   <frameset cols="50%,50%">
      <frame name="left"  src="env.cgi">
      <frame name="right" src="env.cgi">
   </frameset>
</html>

env.cgi

#!/usr/local/bin/perl
print <<END;
Content-type: text/html

HTTP_REFERER = $ENV{HTTP_REFERER}
END

What does this mean? Basically, if your site features frames and you want to track your referrals to a specific frame, you will have to handle each browser differently.

Are you thoroughly frustrated? If not, I admire your bright-and-sunny outlook - you should look into becoming an air-traffic controller; you'd be perfect. Otherwise, I want to remind you that even if every single piece of the tracking puzzle is a nightmare of confusion, you can assemble a picture of your site traffic. It won't be perfect - far from it - but it will provide you with enough information to get an idea of how you're doing and how you can build a better site.

Thursday, October 09, 2008

Long Distance Data Tracking (i.e. longitudinal web analytics)

[
The article below was originally published on WebMonkey in 1998, but Lycos has moved WebMonkey to a wiki and hasn't moved all of the old articles ;^(

Note that it assumes that web content is made up of static pages. This is becoming less and less the case as interactivity and personalization is enabled. Industry players, such as the Internet Advertising Bureau, are now focusing on metrics for this new paradigm.
]

Long Distance Tracking

In my last article, I introduced the types of tracking information you can get from your Web server. In that article I concentrated mostly on what you can do with a single day's worth of data. Now I'm going to show you what long-range data tracking can do for you.

Some questions can only be answered by looking at your data over an extended period of time:

How fast is my number of pageviews increasing? How many pageviews should I expect by the end of the year?

Which areas of my site are experiencing the fastest pageview growth? The slowest?

How is the relative browser share changing over time?

How often do people visit my site?

Of the people who first came to my site via my ad banner on xyz.com, how many pages have they subsequently viewed?

And I'm sure that once you look at the types of information available (discussed in my previous article), you'll come up with all sorts of questions that need long-range answers.

If you're interested in answering these questions, then multi-day tracking is for you. And if you're thinking of tracking, then it's time to seriously consider a database.

Getting Down to Database-ics

You could create from-scratch programs to retrieve the information you want out of your hit logs. Of course you could also spend your life banging your head against a wall. But neither option is really in your best interest. And the more hits you get per day, the more you'll find good reasons to store your hits in a database:

If you design your database correctly, your queries will return the information you want many times faster than programs that retrieve data from log files. And the more data you have, the more you'll notice the difference in performance.

If you only store the hits that interest you (versus every single li'l ol' image request), you can significantly reduce the amount of space your data requires.

Most people use SQL (Structured Query Language) to retrieve data from databases. SQL is a small, concise language with very few commands and syntax elements to learn. Plus, the command structures are simple and well defined, so good programmers can create an SQL query much more quickly than they could code a program to do the same thing. And the resulting SQL query would be less prone to errors and easier to understand.

If you don't want to code SQL, you can use a database access tool (e.g., MS Access or Excel, Crystal Reports, or BusinessObjects) to retrieve information. Many of these tools are extremely easy to use, with a graphical, drag-and-drop interface.

You could also create your own program using one of a smorgasbord of application development tools that make creating a data-retrieving program relatively simple. Of course it's nice to know that, with most database products, you aren't prevented from writing your applications in your favorite 3GL. Many provide ODBC access as well as proprietary APIs. For example, at Wired Digital we've written our reporting application in Perl, using both Sybase's CTlib and the DBI package for database access.

On the other hand, some distinct reasons exist NOT to store your data in a database:

You actually have to implement and maintain the code for loading your data into the database.

Most databases require some resources for administration.

Most database products cost money. [Many viable open source database products have matured since I first wrote this article. See, for example, MySQL, PostgreSQL, Ingres, Firebird...]

You will have to learn SQL, or whatever language the database product you select implements.

Databases are inherently more fragile than flat files. You will have to spend more time making sure you have a good "backup and restore" plan.

Still interested in a database? Now you have to choose: 1) whether to load your hits directly into a database from your Web server, and 2) which database product to load your hits into. Note that these decisions aren't independent - it may be difficult, if not impossible, to load hits into some databases, and some databases may not allow data inserts while queries are being run against them.

The Direct Route

Loading your data directly from your Web server into a database can add all sorts of complexity to your life. If you choose this route, you have to decide whether you can live with lost data. If you can, you may skip the next few paragraphs. Otherwise, read on.

For reasons I won't go into here, higher-end database products use database managers that handle all accesses to the database. Since database managers are software programs, they can fail. So if you have your Web server load its data directly into one of these databases, and the database manager crashes, you may lose this information.

Some Web servers allow you to write code that stores the Web server's information in a log file if the database manager crashes (especially if you have the source code). Of course, in this case you will also have to design a backup process that gets information into your database for those times when your database goes down.

Pick a Database Management System

Here is a partial list of the database products available to you:

Company	Product	Comments
IBM	DB2	Never count IBM out.
Informix	Dynamic Server	Recent company financial problems, but a top-notch RDBMS. [acquired by IBM after publication of this article]
	MSQL	Shareware! Created by David J. Hughes at Bond University, Australia.
Microsoft	Access	Low-end, user-friendly RDBMS.
Microsoft	SQL Server	Mid-range RDBMS. Microsoft's tenacity continues to improve this product. [I would no longer call this "mid-range". It can now compete with the top-end db's]
NCR	Teradata	The Ferrari Testarossa of data warehousing engines ... at Testarossa prices. For very large databases. [spun out of NCR after publication of this article. http://www.teradata.com/]
Oracle	Oracle	The leading RDBMS.
Red Brick Systems	Red Brick	RDBMS designed specifically for data warehousing. This is what we use at Wired Digital. [ acquired by Informix (which was then acquired by IBM) after publication of this article]
Sybase	Adaptive Server	Number 2 in RDBMS market. We use this at Wired Digital for non-data warehouse applications. [No longer #2, but still a viable competitor]

[As I've noted above, there are many mature open source database options now available. I recommend you check them out]

After selecting a database product, you have to design the structure where your data will live. Luckily, your job will be easier than most database designers' because, in the case of Web tracking, there aren't that many different types of information to store.

Here are some goals to shoot for when you design your database:

minimize load times
minimize query times
minimize administration and maintenance
minimize database size

To achieve these goals, all sorts of decisions need to be made. For example, the time it takes to load your data will depend on how much data you want to load, whether you use "lookup" tables, whether your database is stored on a RAID system, and so on.

Also, these goals sometimes conflict. For example, to minimize query time, you may have to create and maintain summary tables. But if you do this, administration and maintenance time increases, and the size of your database grows. And as you make these database decisions, don't forget that people who look at your data will, at some point, want to audit and compare it with the data in your Web server log files.

Finally, if you have experience designing data warehouses, do a clean boot of your brain. This will be unlike any other data warehouse you have designed. For example, a merchandiser like Wal-Mart knows what products it sells and at which stores it sells them. For each product, it knows what category it belongs to, who manufactures it, and what it costs. For each store it knows which geographic region it's in, what country it's in, and its size. All of these "dimensions" are limited in the number of values they can have: when a merchandiser loads sales data into its data warehouse, it doesn't have to deal with unknown entities.

Your tracking data warehouse application, however, will constantly deal with unknowns. You don't know what domains visitors will be coming from, where referrals will be coming from, or what browsers those visitors will be using. And when your users enter information into forms, you may not know what values they'll be entering (especially if your forms contain text fields). And there's no telling how many values these "dimensions" will have.

So pick your tools wisely, and get tracking.

Thursday, October 02, 2008

Tracking Your Web Visitors

[
The article below was originally published on WebMonkey in 1998, but Lycos has moved WebMonkey to a wiki and hasn't moved all of the old articles ;^(

Note that it assumes that web content is made up of static pages. This is becoming less and less the case as interactivity and personalization is enabled. Industry players, such as the Internet Advertising Bureau, are now focusing on metrics for this new paradigm.
]

Don't Forget About Tracking

So you've created the ultimate Web site, and now you're sitting back watching your hit counter go wild. You may ask yourself, "I wonder how many pageviews my help page is getting?" or, "I wonder how many people are visiting my site?"

Unfortunately, when most people start building a Web site, they don't consider they someday might want to track its traffic. It takes enough time just to design the site and create the content. Outlining what information they want to track is just more work that already overworked staffs tend to let slide.

But when it comes down to it, we all quickly become bean counters on the Web. Once a site is up and running, we want to know how many people are looking at our pages and how many pages each of those people is looking at. That's usually when a lot of Web developers discover that had they spent more time thinking about setting up their site, they'd be able to track how it's being used much more easily.

If you're in this situation right now, you've come to the right place. And if you haven't made your site public yet, you're lucky - you still have time to think about reporting before your design is set in stone. Don't miss out on this chance!

What Information is Available?

Before you can decide what type of analysis you want to do, you need to know what information is available. Unfortunately, there's not much tracking data you can collect, and what you can get is unreliable. But don't despair - you can still gain useful knowledge from what does exist.

Your Web servers can record information about every request they get. The information available to you for each request includes:

Date and time of the hit (we'll look more closely at what hits are later on)

Name of the host

Request

Visitor's login name (if the user is authenticated)

Web server's response code (see http://www.FreeSoft.org/CIE/RFC/2068/43.htm for definitions, or go to the source.

Referer (see Toxic's article "~~Who's Linking to You?"~~) [no longer available on Webmonkey, but you can find it at the Wayback Machine]

Visitor's user agent (see http://www.FreeSoft.org/CIE/RFC/2068/205.htm, or go to the source)

Visitor's IP address

Visitor's host (if the visitor's IP address can be translated)

Bytes transferred

Path of the file served

Cookies sent by the visitor (see Marc's article, ~~"That's the Way the Cookie Crumbles"~~ for an overview of cookies) [Marc's article is no longer available on WebMonkey, but you can find it at the Wayback Machine]
Cookies sent by the Web server

Inaccurate, But Not Useless

As I mentioned before, the information you have available is inaccurate but not completely unreliable. Although this data is inexact, you can still use it to gain a better understanding of how people use your site.

To start things off, let's take the 10,000-foot view of everything available and then drop slowly toward the details. So, first let's talk about hits and pageviews. (If you didn't know already - there is a difference. A hit is any request for a file your server receives. That includes images, sound files, and anything else that may appear on a page. A pageview is a little more accurate because it counts a page as a whole - not all its parts.)

As you probably already know, it's quite easy to find out how many hits you're getting with a simple hit counter, but for more precise analysis, you're going to have to store the information about the hits you get. An easy way to do this is simply to save the information in your Web server log files and periodically load database tables with that data or to write the information directly to database tables.

(For those database-savvy readers, if you periodically load database tables using a 3GL and ODBC- or RDBMS-dependent APIs, you can use data-loading tools from the RDBMS vendor - such as Sybase's BCP - or you can use a third-party, data-loading product.)

If you load your data directly into a database, you will either need a Web server with the capability already implemented (such as Microsoft's IIS), or you will need the source code for the server. Another option is to use a third-party API, like Apache's DBILogger.

Once you do that, you can gather information about how many failed hits you're getting - just count the number of hits with a status code in the 400s. And if you're curious, you can drill down farther by grouping by each status code separately.

Pageviews

On the whole, though, counting hits isn't as informative as counting pageviews. And the results aren't comparable to those of other sites (~~see the Internet Advertising Bureau's industry-standard metrics~~ [this link is dead and I can't find the old document. The IAB is now focused on metrics for web 2.0]).

To count pageviews, you need to devise some method of differentiating hits that are pageviews from those that are not. Here are some of the factors we take into account when doing this at Wired Digital:

Name of the file served

Type of the file served (HTML, GIF, WAV, and so on)

Web server's response code (for instance, we never count failed requests - those with a status code in the 400s)

Visitor's host (we don't count pageviews generated by Wired employees)

Once you've determined which hits are pageviews and which are not, you can count the number of pageviews your site gets. But you'll probably want to drill down in your data eventually to determine how many pageviews each of your pages gets individually. Furthermore, if you split your site into channels or sections - we separate our content into HotBot, HotWired, Wired News, and Suck - you may want to determine how many pageviews each area gets. This is where standards for site design can help.

Here at Wired Digital, we've put into place a standard stating that the file path determines where hits to a given file will be reported. For example, a pageview to http://www.webmonkey.com/webmonkey/98/13/index0a.html is counted as a pageview for Webmonkey, whereas a pageview to http://hotwired.lycos.com/synapse/98/12/index3a.html is counted as a pageview for Synapse (because Jon Katz is a Synapse columnist).

If this standard is in place at all levels of your site, you can summarize and drill down through your pageviews at will. Of course, there are some problems with this method. You may want to count a pageview in one section part of the time and in another section at other times. There are ways (that I won't go into now), however, to get around these problems. We've found over the years that this method works best - at least for us.

Looking Deeper Into Pageviews

Once you've cut your teeth on some programs designed to retrieve the types of information I've just explained, you should be able to use your knowledge to code programs to give you the following:

Pageviews by time bucket You can look at how pageviews change every five minutes for a day. This will tell you when people are accessing your site. If you also split group pageviews by your visitors' root domains, you can determine whether people visit your site before work hours, during work, or after work.

Pageviews by logged-in visitors vs. pageviews by visitors who haven't logged in What percentage of your pageviews come from logged-in visitors? This information can help you determine whether allowing people to log in is worthwhile. You can also get some indicat ion of how your site might perform if you required visitors to log in.

Pageviews by referrer When your visitors come to one of your pages via a link or banner, where do they come from? This information can help you determine your visitors' interests (you'll know what other sites they visit). And if you advertise, this information can help you decide where to put your advertising dollars. It can also help you decide more intelligently which sites you want to partner with - if you're considering such an endeavor.

Pageviews by visitor hardware platform, operating system, browser, and/or browser version What percentage of your pageviews come from visitors using Macs? Using PCs? From visitors using Netscape? Internet Explorer? It will take a bit of work to cull this information out of the user agent string, but it can be done. Oh, and since browsers are continually being created and updated, and therefore the number of possible values in the user agent string continues to grow larger, you'll have to keep up to date on whatever method you use to parse this information.

Pageviews by visitors' host How many of your pageviews come from visitors using AOL? Earthlink?

Note that you may want to mix and match these various dimensions. For example, how do your referrals change over time? Does the relative percentage of Netscape users vs. Internet Explorer users change over the course of the day? Does one area of your site seem to interest Unix users more than other areas?

How To Count Unique Visitors

Now let's talk about visitor information. Look at the bulleted paragraphs above and replace the word "pageviews" with the word "visitors." Interesting, huh? Unfortunately, counting visitors is more difficult than counting pageviews.

First off, let's get one thing out in the open: There is absolutely no way to count visitors reliably. Until Big Brother ties people to their computers and those computers scan their retinas or fingerprints to supply you with this information, you'll never be sure who's visiting your site.

Basically, there are three types of information you can utilize to track visitors: their IP addresses, their member names (if your site uses membership), and their cookies.

The most readily available piece of information is the visitor's IP address. To count visitors, you simply count the number of unique IP addresses in your logs. Unfortunately, easiest isn't always best. This method is the most inaccurate one available to you. Most people connecting to the Net get a different IP address every time they connect.

That's because ISPs and organizations like AOL assign addresses dynamically in order to use the limited block of IP addresses given to them more efficiently. When an AOL customer connects, AOL assigns them an IP address. And when they disconnect, AOL makes that IP address available to another customer.

For example, Sue connects via AOL at 8 a.m. and is given the IP address 152.163.199.42, visits your site, and disconnects. At 10 a.m., Bob connects via AOL and is assigned the same IP address. He visits your site and then disconnects. Later, as you're tallying the unique IP addresses in your logs, you'll unknowingly count Sue and Bob as one visitor.

This method becomes increasingly inaccurate if you're examining data over longer time periods. We only use this information in our calculations at Wired Digital as a last resort, and then only when we're looking at a single day's worth of data.

If you allow people to log in to your site through membership, you have another piece of information available to you. If you require people to log in, visitor tracking becomes much easier. And if you require people to enter their passwords each time they log in, you're in tracking heaven. As we all know, though, there's a downside to making people log in - namely that a lot of people don't like the process and won't come to your site if you require it.

If you do force people to log in, however, you can count the number of unique member names and easily determine how many people visit your site. If you don't force people to log in, but do give them the option to do so, you can count the number of unique member names; then, for those hits without member names attached, you can count the number of unique IP addresses instead.

Lastly, you can add cookies to your arsenal. Define a cookie that will have a unique value for every visitor. Let's call it a machine ID (I'll explain this later). If a person visits you without providing you with a machine ID (either because she hasn't visited your site before or because she's set her browser not to accept cookies), calculate a new value and send a cookie along with the page she requested.

So now you can count the number of unique machine IDs in your log. But there are still a couple of issues that we need to discuss. First, as I've already mentioned, many people turn off their cookies, so you can't rely on cookies alone to count your visitors. At Wired Digital, we use a combination of cookies, member names, and IP addresses to count visitors, with the caveat that, as I said earlier, we don't use IP addresses when counting more than a single day's traffic.

Second, the cookie specification allows browsers to delete old cookies. And even if this option wasn't specified, a user's hard disk can always fill up. Either way, the cookies you send to a visitor may be removed at some point. So it's possible that a person who visits your site at 8 a.m will no longer have your cookie when they return at 9 a.m.

Third, when your Web server sends a cookie to a visitor, it's stored on the visitor's machine - so if a person visits your site from home in the morning using her desktop machine and visits again from work using another PC, you'll log two different cookies. Which is why I've called the cookie a "machine ID": it's tied to the machine, not the visitor.

Which brings us to issue number four: Multiple people may use the same machine, in which case you'll see only one cookie for all of them.

Fifth, various proxy servers may handle cookies differently. It's possible that a given proxy server won't deliver cookies to the user's machine. Or it might not deliver the correct cookie to the user's machine (it might even deliver some other cookie from its cache). Or it might not send the user's cookie back to your Web server. Unfortunately, proxy servers are still young. There is no formal and complete standard for how they're supposed to work, and there's no certification service to ensure that they'll do what they're supposed to do.

So with all these issues to consider, here's what we do at Wired Digital:

If we want to count visitors for one day, we count member names.

For hits that don't have member names, we count cookies.

For hits that have neither member names or cookies, we count IP addresses.

And if we want to count visitors over multiple days, we only use cookies. We do some statistical analysis in an attempt to determine how much of an undercount results - but in the end, all these calculations are only estimates.

There's one more issue we need to discuss. Do you want to track the information you have over multiple days? Or is one day's worth enough? If one day's data will suffice, you can get away with simple programs that process your log files. If you prefer to process multiple days' information, however, you'll want to store it all in a database.

Wednesday, October 01, 2008

Online Privacy: What Do They Know About Me?

[I first published this article several years ago. I have updated it with current information]

Several years ago I wrote a set of articles for WebMonkey discussing the information a web site can gather about visitors; how to gather, store, and use that information; and limitations of the gathered information. Those articles were geared toward web site owners who wanted to know how their web sites were being browsed.

Conversations over the years -- and particularly several recent conversations -- have convinced me of the need for an article discussing this topic as it applies to you, the Web user. Some people I’ve talked with have thought web sites could automatically get any information they want about them when they visit their sites. Other people thought they could be completely anonymous. Most people did not have the knowledge of underlying technologies and businesses necessary to understand the full reality. In this article I hope to provide some of that information.

Privacy vs. Security

Before beginning the discussion, I want to differentiate privacy from security. I’m sure you can come up with your own definitions of these terms, and you can find a variety of definitions for these terms. For the purpose of this article I define privacy as having others know only those things about you that you want them to know, whereas security means ensuring that the information you have and/or provide to someone is inaccessible to unauthorized people. While security is very important (and may be worthy of a future article), this article only covers privacy.

What Information Is Available?

Independent of the Internet, the first thing you should know is that there is almost assuredly a lot of information about you stored in commercial databases and available for sale. Types of information about you that may be available include:

Home address (available from the U.S. Postal Service)
Credit records (if you use credit cards)
Home ownership history
Purchase history
History of having children
Magazine subscription history
Anything you may have supplied in response to surveys and on registration forms
Legal records

There are a variety of companies that gather and compile databases containing information about individuals. As mentioned above, the U.S. Postal Service maintains a database of consumers’ current addresses. Experian, Trans Union, and Equifax maintain large databases containing consumer information used for credit reporting. These companies, as well as many others, sell or “rent” consumer information to organizations that want to know more about you. Though old, an article in the Washington Post is an informative read.

SWIPE provides a page describing how you can get your personal records from several organizations.

So what do these companies do with their databases? They provide their clients with information about consumers who their clients would find of interest. For example, an automotive magazine might want the names of people who buy certain types of cars so that it can send offers to them. Database companies also enable clients to learn more about their customers by matching their database records with the information clients have about their customers. So, for example, you may provide an automotive magazine with only your name and address, but by using a database company’s services, the magazine publisher can determine your credit worthiness or your history of auto purchases.

What does this have to do with the Web?

The nascent point here is that if a web site is able to gather one or a few key pieces of information about you (such as name and address, or social security number, or credit card number), it can gain a lot of information about you.

But what if you haven’t provided any information about you to the web site? What can the web site owner learn about you? To discuss this, we must start with some basics.

The Basics

When you open your browser, click on a link, or type a url (web page adddress) and click “go”, your browser sends a request to a web server for the page you want. Along with the url requested, your browser sends other information to the web server:

Your ip address. An ip address is a set of 4 numbers separated by periods. An ip address is assigned to your computer when you connect to a network. Your computer’s ip address is different than everyone else’s on the Internet. But it’s not quite as informative as you’d think. You’ll learn why in the discussion below.

Browser information (usually type and version), and often the operating system you are using.

If you click on a link, the url of the page you were at when you clicked on the link. This is called the “referer” (yes, that is the official spelling, even though it is incorrect).

Cookies that might exist for that web site (more on this below).

Anonymizer.com will show you what information your browser sends.

It’s important to state that your browser does NOT send your name, email address, or other information to web sites - with a caveat about cookies (which, again, we will discuss further below).

IP Address

First let’s talk about the ip address. I stated that ip addresses are not as informative as you would think because your ip address may not always be the same. Every time you connect to your ISP (AOL, Earthlink,...) using a modem, you are assigned a different ip address. If you have a broadband connection to the Internet (cable, dsl...), your ISP may assign your computer a different ip address when you re-connect. And the same may be true of your computer at work. Every time you restart your computer at work, your company’s network may assign you a new ip address.

So, bottom line, your computer’s ip address is not a good vehicle for enabling web site operators to identify you.

With that said, your ip address can be used to determine 1) what ISP you use and 2) where you are (in rough terms - not down to your exact address, but sometimes down to the city level.

This Wired News article discusses ip geolocation capabilities.

Cookies

Your web browser allows web sites to place bits of information on your computer. And it allows web sites to retrieve these bits of information from your computer. For example, abc.com could drop a cookie on your computer containing the date and time you visited their site. The next time you visit abc.com, your browser will pass this information back to the site. So now abc.com knows when you last visited their site.

Web sites use cookies for a variety of purposes. Some examples include:

When you see a checkbox on a web site’s logon page that enables you to log onto that web site without providing your id and password every time, there’s a good chance that the web site is storing your id and password in a cookie.

Web sites may also drop “session” cookies on your computer when you visit them for reporting purposes. The session cookie exists until you close your browser or until a specified amount of time has past since you last requested a page from the site (usually 20 or 30 minutes), and the web site uses it to review how long visitors stay, how many pages they look at, and how they traverse through their sites.

Web sites may store information that makes personalization and form-filling easier. For example, sites that greet you with “Hi, Bill” very probably have your name stored in a cookie.

Now an important point must be made about cookies: cookies that one web site drops on your computer can not be retrieved by another web site. So if you give your name to abc.com, and it drops a cookie on your computer, the web site xyz.com cannot get at that cookie.
So my privacy is assured, right?

Wrong! Forgetting about the Web for a second, let’s not forget that web site operators can sell your information. Legally - or illegally.

But back to the Web. A bit more on the basics. When you request a web page, your browser actually ends up making multiple requests. Every picture and graphic you see on the page is the result of a separate request. And different parts of a page can result from separate requests. So, even though you request the page from abc.com, some requests may have actually gone to xyz.com. Even worse, abc.com may place identifying data into the requests you make from xyz.com. So you may have never provided xyz.com any information about you, but because you provided abc.com information about you and you requested a page from abc.com that resulted in requests to xyz.com, xyz.com now has information about you!

And note that this isn’t a theoretical scenario. Thousands of web sites don’t put up the advertisements you see on their sites - they allow companies like AOL, DoubleClick (now part of Google), 24/7 Realmedia, Atlas DMT, ValueClick, and others to control the advertising space on their sites. So, for example, when you go to the Wall Street Journal Online, the page you request will call up ads from DoubleClick. Now imagine that DoubleClick serves ads for thousands of web sites. If DoubleClick drops a cookie onto your pc when you visit the Wall Street Journal Online, and then you visit New York Times on the Web (which also contracts DoubleClick to serve ads on its site), DoubleClick now knows that a single individual visited both sites. And if you’ve provided personal information to one of these sites, and it passes identifying information to DoubleClick, it’s feasible that DoubleClick can provide the other site with that indentifying information (note that I’m not saying DoubleClick actually does provide this service, nor that its customers provide it with identifying information - I’m just saying it is feasible).

A Quick Discussion about Email

Email can be sent to you in either plain text or HTML format (meaning formatted like a web page). If your email software is configured to allow the display of graphics and to allow JavaScript and/or VBScript, emails to you can be tracked. Emailers will be able to determine if and when you read their emails.

Also, unless you encrypt the emails you send, they are easily discernable as they are sent over the Internet, just as postcards can be easily read during their travels to their destinations.

Sounds Hopeless - What Can You Do?

So even if you don’t provide personal information to abc.com, it might be able to get that information from some other organization. What can you do?

First, you must decide how important it is for you to control information about you. Because the more you try to protect your privacy, the less useful you will find the Web. Given that you want to maintain some control, you can take the following steps (in order of increasing inconvenience to you):

Opt out of as many lists as you can. Start with the companies listed on SWIPE’s site.

Browse the Web using privacy software such as Tor or services such as Anonymizer, Anonymise.com, or MisterPrivacy.com.

Configure your browser (and email software) to turn off image loading. Images are often advertisements. If you turn off image loading, many advertisements will not be requested. Note that doing this does not preclude your browser from sending information to web sites via JavaScript.

Configure your browser to disallow pop-up windows. Since many pop-up windows are displayed for the purpose of displaying ads, this will serve to block requests for those ads.

Configure your browser (and email software) to turn off JavaScript and VBScript. This handles the issue described above. But it also means you will lose some functionality at many web sites.

Configure your browser to turn off cookies. Note that when you do this, many sites will no longer be able to log you in automatically, and many other sites won’t allow you to visit at all.

Encrypt your emails. You may need special software to do this, and your email recipients may have to have special software to decrypt them.

Don’t give out information about you in the first place. Note that this will preclude you from shopping online and from being able to visit many sites that require registration (of course you can provide untrue information in the latter case, but for legal reasons I can’t recommend that).

When you shop offline, use cash instead of credit cards, debit cards, or checks.

Move into the wilderness or buy an island and live off the land.

Bottom Line

While your browser doesn’t directly send personal information to web sites (that they have not already saved in cookies on your computer), but your privacy is far from assured as you surf the Web.

Friday, September 26, 2008

IT and organizational strategy

Yesterday I attended the InformationWeek500 Virtual Event. In one presentation, Rob Preston, Editor-in-Chief, discussed survey findings regarding the status of CIO's in their organizations. Predictably, the survey reveals that IT is seen as a cost center rather than a provider of strategic capabilities, and CIO's are regarded less as equal members of the executive team than as managers of utility services.

The problem with this common belief is that it is self-fulfilling. IT is not included in strategy determination, and it is not funded in such a way as to be able to provide strategic capabilities, thereby proving the original belief. This is harmful to organizations, as they are missing out on opportunities to differentiate themselves, their products and services from their competition's; to make significant improvements in their processes and productivity; and to provide increased value to their customers and shareholders. And missing these opportunities is undoubtedly financially harming these organizations.

I asked Rob how we could educate CXO's on a larger scale than individually. He stated that he sees no other way; and that if the organization doesn't see the value of IT, move to another organization. Though frustrated by his answer, I know I can't expect an easy answer to this problem, given its prevalence and longevity.

However, I don't think we should simply throw up our hands and ignore the issue. All this would do is serve to let this less-than-optimal situation continue unabated. So here's my thought: we need to step out of our IT thinking, and try to think like marketers. We need to stop complaining amongst ourselves and start educating our business partners. But in my experience, simply providing these people with a vision of the value we could provide doesn't convince. Rather, I think we need to start compiling real-world examples of the value IT can provide - as well as real-world examples of the downside of ignoring IT. Here's a start:

Harrah's Entertainment has measurably increased sales and customer satisfaction by gathering and using detailed information about its customers and their transactions.
Dell became a leading pc vendor by using IT to create a super-efficient and flexible supply chain. It is now using IT to host a web site allowing its customers to have a say in the products it offers.
Amazon.com was able to become one of the world's the largest book sellers and major retailer because of the IT infrastructure it put in place to take orders, suggest related products, and quickly fulfill those orders. Now it uses that infrastructure to enter new markets, such as providing ecommerce hosting and offering computing services.
Wal-Mart first connected its stores together in the 1960's(!), implemented its own satellite network in 1987, and continued to use IT to track inventory, stock shelves, and develop an efficient supply chain. Wal-Mart is currently a significant influence in the adoption of rfid for tracking product.
GM has leveraged IT to "cut the delivery time on new vehicles from 70 days to 30, and saved the company millions per year in crash testing by moving to digital simulation."

While it's possible rattling off examples such as these may have an effect on our business partners, wouldn't it be great if the big IT consulting companies such as IBM, HP, and EDS would create marketing campaigns highlighting the importance of IT?

Monday, August 25, 2008

tracking the location of your laptop

The University of Washington has published free, open source software you use to track the location of your laptop. It periodically communicates its location (ip address) to a distributed set of servers (OpenDHT). With a Mac OS X laptop, you can also have it periodically capture and send pictures to those servers.

When you install the program, you will be provided with a complex, unique identifier. You must save that id in a separate location (e.g., a thumb drive). You will need it if your laptop is stolen.

If your laptop is stolen, you would then install Adeona on another computer and run a program to retrieve the location of your laptop. You would then call the authorities to help you get the laptop back.

Monday, July 28, 2008

Office Depot Good Deed

It's not often that you hear of a corporation going out of its way to be a good citizen, so I figure when I see one that does, I should publicize it. Office Depot just sent me a rebate check - but not one they owed me. They paid me the rebate I was owed from the manufacturer. Granted I bought the product at Office Depot, but they could easily have just hoped that I'd forget about it. Instead they made good on the manufacturer's promise. Here's the letter they sent me:

Dear Office Depot Customer,

Enclosed is your rebate payment in connection with your purchase of [...]
from Office Depot [...] We understand that the receipt of your rebate may have
taken longer than anticipated, and we regret any inconvenience that this delay
may have caused you. As you may know, this particular rebate was not
issued by Office Depot; rather it was issued by a vendor, from whom we acquired
the merchandise you bought at Office Depot. This vendor was directly
responsible for funding the rebates, but failed to do so.

In light of that failure and because Office Depot values you and your
expectations as an Office Depot customer, Office Depot has determined to satisfy
the rebates itself. Please accept your rebate with both our apologies for
the delay in getting it to you and our thanks for your loyalty and
business.

Thank you again for your understanding.

Sincerely,
John Lostroscio,
Vice President, Merchandising.

Good job, Office Depot. You've scored a point with me, and hopefully with the people who read this blog.

Sunday, July 27, 2008

Highly Recommended Programs and Services

There are a lot of interesting and compelling programs and services - more than any single person can keep up with. But out of this multitude, there are a limited number that I think everyone should seriously think about using. This is a table of contents to a list of articles about the software programs and services that I highly recommend.

Keeping online backups with Syncplicity

For a while I've been wanting to write a set of articles discussing the programs and services for Windows users that I most highly recommend. There are a lot of interesting and compelling programs and services - more than any single person can keep up with. But out of this multitude, there are a limited number that I think everyone should seriously think about using. So I'm finally getting around to it. This is the first of those articles.

Here's the issue: your pc can be stolen, it can die, it can burn up in a fire, you can take a hammer and destroy it in frustration ;^) And unless you have backups of your important files, you are up the proverbial creek.

What's the most common response to this risk? Copy your files to a usb drive, write them to a CD/DVD, or write them to an external hard drive.

But there are two problems with this solution. One, you have to remember to do this (or actually do it when your backup program prompts you). And two, unless you store the backup somewhere other than where you live (e.g., a safe deposit box), a fire or theft leaves you up the same creek. Oh, and a third problem - this method is an ongoing pain to do.

So here's my strong recommendation: use Syncplicity to automatically keep up-to-date copies of your important files on Syncplicity's disk storage.

When you sign up with Syncplicity, you will be provided with 2GB of storage space. If you pay $10/month, or $100/year, you will be provided 40GB of storage space. And you can get another 50GB by paying another $10/month or $100/year.

And - most importantly for the purpose I'm discussing - Syncplicity makes available a software program you can use to automatically keep your important files up-to-date on Syncplicity's disk storage. Use it!

Syncplicity has a couple other nice features:

You can access your files from anywhere you have an Internet connection.
You can share any of your folders with other people.
You can keep files synchronized between multiple pc's. I have several files I want both at home and at work, and this service allows me to edit the files without worrying that they will get out of sync.

Now you may be concerned about security - can other people on the Internet (or, for that matter, Syncplicity employees) view these files? In the spectrum of reckless abandon to total paranoia, I tend to lean toward paranoia. Syncplicity states that your files are strongly encrypted on their servers. Even so, I encrypt my important files on my pc before I let Syncplicity at them. It adds an extra step with some of my files, but that's what I get for leaning toward the paranoid side of the spectrum. Note: I'll take about encryption utilities in a coming article.

Bottom line: until now, keeping backups of your important files has always been a painful process, so most people don't do it. Syncplicity makes this a no-brainer. Just do it!

Wednesday, July 23, 2008

OSCON 2008

I've been going to these for several years now, and it's great to see that each conference is better than the last - the sessions are more interesting, and the expo hall has more - and more interesting - vendors.

Evan Henshaw-Plath and Kellan Elliott-McCrea gave a very thought-provoking - and entertaining - presentation offering a way to provide subscription updates when RSS becomes untenable. Kellan gave an eye-opening example: on 7/21/08, FriendFeed requested RSS updates from Flickr almost 3 million times! That alone caused the audience to laugh. But then he said that these requests were for updates of only 46,000 Flickr users, and only 6700 of them had logged onto Flickr in the past 24 hours! And he hadn't gone one step further - only a subset of those 6700 actually uploaded new pictures! That's so head-slapping that I have to state it once again - 3 million update requests for less than 6700 updates! So is there a better way? Evan and Kellan's idea is to use the Jabber XMPP protocol. The client (in this case FriendFeed) opens a connection with the server (in this case Flickr) and tells the server what updates it wants (in this case the web pages of the 46,000 users). The server then communicates updates to the client when one of those web pages is updated. Much more efficient than RSS. Check out their slides - they are entertaining as well as informative.

Dave O'Flynn gave an interesting presentation about Atlassian's attempt to define a generic set of api's for authentication and authorization. A couple months ago Atlassian realized that its separate solution for each product was becoming more and more burdensome. Rather than creating a solution for them alone, they decided that a public, open specification would be good for the whole industry. With a generic api, the back-end implementation could be swapped out without breaking the client applications. Several people in the audience were skeptical, mentioning that alternatives already exist, such as OpenID, Higgins and OAuth. Dave's response was that the current options are either incomplete or too difficult to use; and that, under the covers, implementations can use those utilities if you want - the api's would simply hide a lot of the complexity. It's a good idea that, if it comes to fruition, could make application development easier within organizations by leading them to the development of common authorization/authentication utilities.

John Ferraiolo gave a hopeful presentation on the OpenAjax Alliance, and its work to create a standard for Ajax frameworks and widgets (allowing them to coexist and interoperate) and a framework that enables widgets from different domains to communicate with each other on a single page. It's a great idea - as he said, right now we're building a tower of Babel. I've already seen how frustrating it can be to not be able to use multiple Ajax frameworks simultaneously, so I applaud their work. And I look forward to seeing how valuable multi-domain mashups become.

On the Expo floor, I talked with Yahoo about Zimbra. Yahoo is providing a valuable service in offering a compelling alternative to Exchange and Outlook for the enterprise space (competition is always good). My initial concern was about committing to this given Microsoft's interest in acquiring Yahoo. I'm sure Microsoft would love nothing more than to kill this product. It is open source, so Microsoft can't kill it outright; but if it acquires Yahoo, it could kill Yahoo's commercial licensing and support. The Yahoo representative I spoke to believes that Zimbra customers would be so up in arms that they would force Microsoft to continue supporting it. For example, Comcast is licensing Zimbra for its millions of customers. Angering customers so large would be a risky move, but Microsoft doesn't seem to mind making risky moves sometimes. Even so, my belief is that, even if Microsoft were to acquire Yahoo and kill Zimbra support, another organization would immediately offer Zimbra support. I have one recommendation for Yahoo - if a Microsoft acquisition ever gets to the point of seeming imminent, Yahoo should open source the closed-source, value-added functionality of its commercial offering.

I had also hoped to learn more about Yahoo's plan to make IndexTools freely available. The Yahooites I spoke with today didn't know anything about this. Hopefully I'll be able to find out more tomorrow.

Wednesday, July 16, 2008

Lunch 2.0 at Souk today

Went to Lunch 2.0 at Souk in downtown Portland today. As usual, met interesting people:

Jim Helms, the blogger behind Today's Best Tools. A medic/soldier turned blogger.
Lea, blogger behind A.R. and Proud and Camp Naughty, and half of team that published the widget treasurelicious. Blogged about a greater variety of subjects until a posting about Oprah's use of the word "vajayjay" resulted in a surprising number of comments. So now she's spending more blogspace on sex education.
Dawn Foster (aka famous GeekyGirlDawn), social media consultant and a driving force behind the Portland tech community.

And if you need flexible office space, check out Souk.

Taking a step back, once again I'm impressed with the Portland tech community. It seems that Portland has a natural attraction for friendly, creative, open-minded people interested in sharing their knowledge and opinions and building interesting tools and technology together.

Sunday, June 15, 2008

agile implementation versus releases

I've noticed this conversation coming up more frequently lately - business customers and even IT managers stating their desire to have IT development groups schedule software releases every so many months. They state this as a preference over what I call "implementation as completed" (similar to agile's short development cycles, but more flexible). Note that I am talking about development and enhancement of internal applications, not software for sale.

I've stated my disagreement with this direction. I've won some people over, some not. It seems so clear to me, but either I'm not communicating my argument very well, or it is a difficult concept to grasp. Maybe putting my argument in writing will help. So here goes:

I have 3 arguments for implementing functionality as it is completed:

You have more flexibility in adapting to changing priorities. The people I'm talking about want to have an up-front process for selecting functionality to be included in the next release; and then have that functionality made available on the release date. The problem with this is that it locks you into that specification. Sure, you can change the scope after the work has begun, but that requires negotiations, reestimation of the work and time required, and possibly re-work. In contrast, my preferred method is to maintain a prioritized list of desired functionality, and to work on these requirements in prioritized order. If priorities change, you either lose only the time put into the current task, or you only have to wait for the current task to complete before the new highest-priority task is begun.

Using the release (or "big-bang") method, the greatest-needed functionality is made available at the same time as the least-needed functionality.

My last argument is similar to the previous one, yet subtly different. Using the big-bang method, you get all of your functionality on the release date. Even if all of the functions have the same level of need, you still end up losing value. I've created a very short and simple spreadsheet to exemplify this. A screenshot of the spreadsheet is below. For the sake of this example, let's say you want to have 6 functions implemented, and each provides a value to you of $10/month. Let's also say that the 6 functions each require a month of work. Using the big-bang method, you get no value until the release at the end of the 6 months. In contrast, the spreadsheet below shows that, by the end of the 6th month, you've received $50 of value from the implementation of function 1 at the end of the 1st month, $40 of value from the implementation of function 2 at the end of the 2nd month, and so on. So whereas in the first 6 months you get $0 of value using the big-bang method, you get $150 of value using the implementation as completed method.

Lastly, note that the example assumes that each function provides the same level of value. In reality, the implementation as completed method compares even more favorably because the greatest-value functions will be implemented first, providing an even greater value difference. For example, if the first function below had a value of $20/month, you would receive $100 of value from that function alone during the time you would be otherwise be waiting for the release.

If you have arguments for the big-bang method that you believe override my arguments above, I'd love to hear them!

Tuesday, May 27, 2008

Freeze Rows and/or Columns in Excel

This may not be sexy, but I find it very helpful, and it seems few people know about it.

You can freeze rows and/or columns in Excel. This is great for keeping column headings and/or the first few columns of data on the screen as you scroll through the data.

To do this, move to the cell just below the last row you want to remain on the screen and just to the right of the last column you want to remain on the screen. For example, if you want the first row and the first column to remain on the screen, move to cell B2. Next, select "Window" "Freeze Panes" from the menu options.

If you want to unfreeze, select "Window" "Unfreeze Panes" from the menu options.

Wednesday, April 30, 2008

Disqus

I'm intrigued by this service. It looks like you can easily add a comment section to any page; and you don't have to install or maintain any software, or maintain the comments repository - it is all hosted on disqus servers. All you have to do is add a JavaScript snippet to your pages.

On the other hand, I have to wonder how long they're going to be around - so far they have no business model. Seems to me they ought to be able to charge for this service, especially for larger web sites.

Vyser RoamAbout - browser widgets

Do you every find yourself looking at a web page containing an address that you want to map, but there's no map on the page and no link to have it mapped? So you open another page to display Google Maps, MSN Maps, or Mapquest, go back to the original page, cut the first line of the address, go to the map page, paste the text into the search field, go back to the original page, cut the second line of the address, go back to the map page, and paste the rest of the text into the search field, and then click on the button to display the address on the map - whew!

Or you are reading about a company, and you decide you want to look up the company's stock price, or get other information about the company. So you cut the company's name, then go to one of the myriad of business sites (e.g., Yahoo Finance, MSN Money, Google Finance, SmartMoney, etc.), paste the name of the company into the search box, and then click the button to display information about the company.

How would you like to display an address on a map just by selecting it and clicking a button? Or display information about a company just by selecting it and clicking a button? You can do these things and more with Vyser RoamAbout (though admittedly -for now - only if you are using Firefox).

RoamAbout puts an icon in the lower right corner of your browser window. You click on it to open or close a sliding icon bar displaying icons for the functions it provides. The picture below shows what your browser window looks like after you select an address and click on the "map" icon:

The picture below shows what your browser window looks like when you select a company name and click on the "stock quotes" icon:

Friday, April 18, 2008

Innotech - Day 2

Only got to the morning sessions on Thursday - family emergency kept me away from some afternoon sessions that looked like they'd be good...

Agile Project Experiences
So many discussions and articles about the superiority of the agile methods lack the depth and details I am looking for - and this panel discussion did nothing to satisfy my thirst. Am I the only one feeling frustrated? Like other discussions I've had and presentations I've seen, these panelists talked about how they could do things using agile methods they couldn't do otherwise, but they didn't specify what those things are; they said their velocity increased, but they admitted that they couldn't measure this; and they said the quality of their solutions increased (in terms of bug rates), but they didn't have hard numbers. Regarding that last issue, Arlo stated that they pretty much just don't find bugs any more, but I'm more than a little skeptical about that.

My second issue is that agile proponents always compare these methodologies with the strict waterfall methodology. But I haven't seen a strict waterfall project in over 20 years. Since those early projects (at GE), every project has had constant communication with customers and other stakeholders, iterative prototyping or development, functionality prioritized and built in order of importance or risk, and/or intra-project negotiations with respect to change requests, scope changes, etc. So it seems to me the real question is one of asking about the benefits of a strict agile methodology (which is what I always hear profferred) versus the more amorphous, flexible development progression that seems common in real life.

And now my last issue: so far I haven't heard a satisfactory answer for how, using agile methods, to provide management with the information they want. At this session the panelists said management will be happy when they see the increased velocity and the list of functions completed. Is my experience with management unusual? I've never been involved with management that would accept that. In my experience management wants to be able to plan future work and the resources needed for it (i.e., project portfolio management). That means they want estimates for the length, value (e.g., ROI), and resource requirements of every project. And they want projects managed to those estimates.

Maybe my issues exist because of the space I work in - corporate, internal development. I can see how agile methodologies could work well in commercial or open source software development. For example, Microsoft provides general guidance as to the functionality they will include in their next operating system versions, but as development progresses, they drop capabilities to help reach their desired ship date. And these methods seem even more natural for open source software development. For example, I can see how Linux development is best done using agile methods.

There's so much more I could write. But I've rambled enough.

Wiki Then and Now
Ward Cunningham gave a two-part presentation. The first was a historical perspective of his invention of the wiki and his involvement in the development of agile methodologies. Fascinating and impressive.

The second half was a discussion about aboutus.org, the company founded by Ray King and for whom he currently works. This company troubles me. They scan domain registrations and create a wiki page for each domain. And they allow anyone to edit those pages. In his presentation Cunningham said "you've been drug out [into the public], and we're here to help you." But they're the ones who have done the "drugging"! I may rile some feathers, but I just don't see value in their site for companies:

To be listed, you have to have a domain. But if you have a domain, you most likely already have a web site, so you don't need AboutUs to advertise your existence.
AboutUs might advertise the value of allowing your customers to comment on your companies services. But if you want, you can provide this capability on your own web site. And you'd have control over the publication of these comments too.
People who search for your company may choose AboutUs' links rather than yours - do you see any value in that?

And companies' customers may not see much value either. AboutUs might argue that you can see comments - both positive and negative. But smart companies will lock their AboutUs pages (Cunningham said you can do that, but their web site contradicts this statement) - or have them deleted (Cunningham said they would do that for companies who request it - though the site contradicts this statement).

Given all this, I highly recommend that, at minimum, companies actively monitor their page on AboutUs.

Wednesday, April 16, 2008

Innotech

Went to Innotech today. Glad I did.

The Secrets to Predictable Innovation

The first session I attended, presented by Anthony Ulwick, the author of "What Customers Want" and founder of Strategyn. Great session. There's an unending outpouring of articles and books talking about the importance of innovation, but not nearly as much discussion of how to innovate.

Ulwick defines innovation as "the process of devising a product or service concept that satisfies unmet customer needs", and goes on to stress that successful innovation requires an understanding of customer needs. In his presentation he talked about what a customer need is: not specific tools or solutions, which change over time; but improvements to help them with the things they want to get done - their "jobs" - that don't change. For example, people don't need microwave ovens. Rather, they want to be able to cook their food faster. The solution (currently microwave oven) may change, but the need (cook food faster) doesn't.

He discussed how to gain this understanding. You do not ask people what they want. Rather, find out what they want to get done, how they do it, and what their pain points are. Watch them do their jobs. Interview them about their jobs.

When you have this understanding, break down the job into subprocesses; and for each subprocess ask your customers how important it is and how satisfied s/he is with it. Your best opportunities are those that are most important to the customer and with which s/he is most unsatisfied.

I like his process for creating a framework. His process doesn't result in a set of requirements that can be used to create a product or service (more on this in the next paragraph), but it does create a context for ongoing analysis of your market. It also helps reveal adjacent markets. For example, rather than wanting to cook food faster, maybe the customer really wants to get food to the table quicker. In this case, you can also look at the process of getting the food from the cooking device to the table as an area ripe for improvement.

So why doesn't his process result in product/service requirements? Because it doesn't quantify the changes needed to create a successful solution. For example, your customers may want to cook their food faster. But if the product you are able to create doesn't provide enough increase in cooking speed to tip the scale against the perceived costs (e.g., money, size, attractiveness...), your product won't succeed. So having knowledge of customers' general needs is a great starting point, but you need to determine the specifications that will result in success in the market.

Accessing Innovation in Oregon

Presentation by Dana Bostrom, Director Innovation & Industry Alliances, Portland State University (does this department have a web site?); Chuck Williams, Associate Director, Office of Technology Transfer, University of Oregon; and Rick Fisch, Managing Director, Northwest Food Processors Innovation Productivity Center.

Bostrom and Williams talked about the "Oregon Innovation Portal", a web site they are creating to communicate information about the research and IP being generated at Oregon universities. Sounds like a good start, but I'm disappointed that they are not planning on developing a two-way conversation on the site.

The Changing Landscape of Venture Capital Investing

Christopher Logan, Entrepreneur, & Strategic Advisor; and Randall Lucas, Associate, Voyager Capital put on a good show, with Logan providing the entrepreneur's viewpoint and Lucas the VC's.

Logan started by saying he's noticed a rise in the number of professional angel investors and angel investment groups. He cautioned the audience to make sure working with them would be right for them. In contrast to traditional angel investors (friends and family), these people will want you to issue preferred stock (possibly sooner than you want), and they will expect you to put in place corporate governance structures that require time and overhead. He also cautioned the audience with regard to the micro vc's, who may want a disproportionate percentage of equity (up to 10%) for very little capital infusion.

Lucas compared these groups with traditional vc's, stating that vc's behave predictably. Voyager, for example, reserves an average of $8 million per deal (though they don't inject all of that at once); and you know that they are going to want preferred stock. He also made the point that the pedigree of the financial backer can affect the hype around the startup's future valuation and ipo.

An audience member asked if startups have to have IP to get vc funding. Lucas said they want to see sustainable advantage. It could be IP, a monopoly on knowledge in a specific space... Logan added that the rise of open source is changing thoughts on the value of IP.

Logan talked about the importance of having the right investors. When he was CEO of Driveway, the company received $68 million of vc funding. The dot-com boom ended and things got tough, but he thought he had a business model that could ensure survival. But the investors didn't want to be tied up long-term - they wanted a quick exit. So he was replaced.

Lucas stated that vc's like Voyager don't invest in R&D. If you want funding, you need a couple of credible people (management team), a product or service you can demonstrate, and possibly organizations who are customers or who will say that they will be customers if your business shows viability. He stated that vc's want to minimize 3 risks: technical, market, and execution.

Tonight's debate between Clinton and Obama

I hate to talk about politics in the U.S. because it's almost impossible here without the outcome being anger, hurt feelings, and harmed relationships. But I'm watching the Clinton/Obama debate right now, and I'm amazed at how insipid it is and how inane are Charles Gibson's and George Stephanopolous' questions. One of these people (thank God I mean Clinton or Obama and not Gibson or Stephanopolous) may become the President of the U.S., dealing with:

the wars in Iraq and Afghanistan,
our country's $9 trillion debt,
global warming,
our country's shattered reputation in the world,
the rising price of oil and possible permanent decline in oil production as global demand increase,
the rising price of foods,
rising medical costs,
declining inflation-adjusted incomes,
the mortgage/home foreclosure/credit crisis,
etc.

and all they are being asked about is whether they will choose the other for veep, their relationship with people who have said questionable or detestable things, and whether they love the flag! Given the real issues we have, are these the questions Americans really want to focus on?

And I can't help but compare this debate with the republican debates. I don't recall the moderators incessantly focusing on things that trivialize the candidates and the issues. Am I mis-remembering?

Ah, they are finally being asked some substantive questions - but even the questions about these are asked in a manner that doesn't allow for intelligent discussion. They were asked if they would promise not to raise taxes. This question is not meant to allow for discussion. What if they want to redistribute taxes? that raises taxes for some, but lowers them for others. What if we have another Katrina, or something even more costly? No one can predict the future, and the candidates shouldn't be forced into such a corner.

I have written to ABC news, letting them know of my disappointment in the handling of this debate. I hope you will too.

Monday, April 14, 2008

Web Analytics Wednesday in Portland

Just a quick note - Web Analytics Wednesday, a get-together for web analytics professionals, will be held this THURSDAY at WebTrends in Portland. WAW on a Thursday - isn't that just like Portland?

Friday, April 11, 2008

Lunch 2.0 at eROI

I went to Lunch 2.0 held at eROI on Wednesday.

I didn't see all of eROI's new office space, but what I did see reminded me of HotWired - old building, warehouse-type space, wood floors, bright-colored couches. And the area they are in reminded me of the South of Market area in San Francisco, where HotWired was located. Though there are some parts of my HotWired experience I don't wish to re-live (e.g., the dismantling of the web site, the 3 failed IPO attempts...), I do miss the environment.

About 70 people met for lunch. Of the people I met (and the ones I already knew), it hit me that a surprisingly large number of them are working on ways to improve (or replace) our educational system. And that ties in with Rick Turoczy's recent call to action to make a difference in education. So I wonder if what I'm seeing is an indication of something big going on - and, if so, is it just in Portland, or is it more wide-spread?

Wednesday, April 09, 2008

Startupalooza Continued

Sorry for taking so long to continue my write-up on Startupalooza. Anyway...

Bill Lynch and Matt Tucker, the co-founders of Jive Software spoke next. Their goal is to enable organizations to store their knowledge in Jive's collaboration environment, Clearspace, rather than in emails.

Jive was founded in 2001. They now have 2000 customers. They built the open source Jive Forums in college in Iowa, and subsequently moved to San Francisco. In late 2000 Sun Microsystems approached them (how nice!) and asked them to enhance the software; and pushed them to incorporate. They moved to New York with their business, but found it was too expensive to do business there. In 2004 they moved to Portland, and started over at staffing. Clearspace came out in 2004, at which time they had 35 employees. They now have 140.

They were funded by Sequoia Capital; and, surprisingly, they say Sequoia has never pushed them to outsource their development or move their development staff to a low-cost country (e.g., India or China). Hard to believe, given what I have heard in the past few years about VC's and boards of directors.

When asked, they said they don't see any competition coming from open source - their customers are looking at Microsoft and IBM as their competitors. They showed a surprising lack of concern for open source. With their short track record of quick success and growth, their attitude is understandable. But I think unrealistic. There are plenty of good, viable open source wiki and collaboration projects; and if they continue to maintain developer interest, they can only become more formidable.

I'm also surprised they didn't mention Atlassian's Confluence as a competitor. It's a good product, and it is actively maintained and enhanced. I know less about MindTouch, but I wonder if it will gain traction.

Following this presentation was a panel discussion about working independently. Sarah Gilbert, Justin Kistner, and Rick Turoczy were the panelists, and Adam Duvander moderated.

Following this, there were presentations of OpenId, ExpressionEngine (very pretty CMS - I'd like to research it further to see how deep the functionality is), Unthirsty (funny guys, and I'm impressed with the number of bars they've been able to compile), Earth Class Mail (they handle your mail for you - could be valuable to large organizations), Lunarr (web app that let's you add notes to the "back side" of web pages - interesting concept, slick interface, but I don't know how valuable it is), Sidecar (a gadget you add to your site to enable IM with visitors, provide them with information, and allow them to provide you with feedback - definitely valuable), MyStrands, Fyreball (allows you to post content on their site and share with your friends - I'd love to hear what new value this provides), Toonlet (enables you to become a creator and publisher of online comic strips - what a great way to waste time - seriously!), and I Want Sandy (an innovative devleopment of an automated assistant, though I'm nagged enough by my mobile phone, email, desktop gadgets, RSS feeds...).

All in all, it was a great day - I'm glad I went, and I look forward to the next one.