AJR  Features
From AJR,   July/August 1997

Web of Confusion   

Determining the number of people who visit an individual Web site has proven to be an elusive goal.

By Scott Kirsner
Scott Kirsner is based in Boston.     



D OESN'T ANYONE ON THE INTERNET know how to count?

Given the reputation for number-crunching that has surrounded computers since they came on the scene, you might expect that tallying the number of users who visit a given site on the World Wide Web would be a simple task. It's not.

With more than 500 daily newspapers in this country operating Web sites, according to the Newspaper Association of America, and with 400 more expected to launch by the end of the year, tracking online readership has emerged as a critical issue. Without solid information about how many readers visit the site and who those readers are, it's tough to effectively allocate resources to an online effort and even tougher to attract advertisers.

Yet a number of barriers, stemming from semantic nebulousness and technological limitations, prevent newspapers from accurately answering some key questions about their online readerships: Exactly how many readers are out there? How often do they visit my site? How long do they stay during a typical visit? And who are those readers?

To add to the confusion of dealing with an evolving technology and vague definitions of terms like ``page requests" and ``visits," newspapers also are being forced to decide which approach they will take to measuring online readership. They may opt to use internal software tools, like Inters*'s Market Focus or net.Genesis' net.Analysis, which attempt to analyze a site's server log. They may elect to be audited by organizations like I/PRO or Audit Bureau Verification Services (part of the Audit Bureau of Circulations), which try to validate the statistics recorded on the server log. They may choose to implement software that monitors traffic as it happens at the network level, like Accrue Insight. Or they may decide to subscribe to research from PC Meter, which follows users' clicks around the Web with software installed on their home PCs. Typi- cally, newspapers are being forced to figure out which combination of these various measurement approaches they want to adopt. ``We need strong and precise measurement so we know how to reach people with this new medium," says Jack Fuller, president of the Tribune Co. and a 1986 Pulitzer Prize winner. ``The most perfect piece of journalism which fails to reach people is a failure--it's not good journalism."

I T'S ALMOST AS IF THE COUNTY FAIR opened without first setting up the turnstiles. Most newspapers gave little thought to measurement issues before launching their Web sites. ``Stats were really an afterthought," says Dan Peak, Webmaster at the Kansas City Star. ``We knew we were going to have to track users, somehow, sort of. But it wasn't a priority at the beginning."

Even sites that did plan to track usage from the start, like the New York Times, ran into problems. Since every visitor entering the Times site is required to register by getting a password and providing demographic information, the site must create a new database record for each new user. On opening day, the computer managing the database crashed due to the huge number of visitors. The paper was forced to temporarily disable its sophisticated registration system.

``You don't necessarily think about measurement as the first thing you want to do," says James Conaghan of the Newspaper Association of America. ``You want to go on a shakedown cruise first, and then deal with measurement issues as they come up, or as advertisers demand certain numbers."

After the shakedown cruise, most newspapers and magazines face the challenge of dealing with the huge quantity of data generated by thousands of users visiting their site. ``You quickly realize you're dealing with tons of data that are very hard to analyze," says Grady Seale, who helped the Boston Globe launch its site in 1995.

Where do all these data come from? Every single request that a user makes of a Web site generates an entry line in the server's log file. Visiting the front page of the New York Times site, for example, generates five lines in the log file: The first request is for the HTML page itself, the second is for the large image that contains the day's headlines and a color photo, and the third, fourth and fifth requests are for advertising graphics at the bottom of the page. Each of these requests (or hits) writes a line to the server log that contains information about the name of the file requested, the size of the file, the date and time it was requested, the identifying number of the computer that requested it (called the Internet protocol address, or simply IP address) and a number of other parameters.

But collecting all of this information means the server logs grow large and unwieldy very quickly. ``People are gagging on the volume of data," says Alec Dann, formerly with WashingtonPost.com. ``The amount of disk space involved is just unbelievable." An average day's log file for the Washington Post's site weighs in at over 500 megabytes and takes more than three hours to process.

O NCE THE QUESTION OF HOW TO STORE and manage the massive amount of data has been settled, newspapers must decide how they intend to slice and dice information to extract what's useful. Some papers write their own analysis programs. Others purchase commercial Web site analysis packages.

Developing custom software saves the upfront costs of buying an off-the-shelf package, but it can consume large amounts of staff time. ``It probably took us a month, and it's still not done," says the Kansas City Star's Peak, who proceeds to reel off a list of pending improvements.

The benefits of tailoring software to your own site, though, are comparable to purchasing a custom-made suit versus one off the rack. ``What I was finding is that most of the programs out there just weren't giving me the stats I wanted," says Peak. He says that commercial programs miss many of the day-to-day nuances that accompany the news-oriented content of his site.

Other Webmasters, though, have grown frustrated with developing their own tools. Lemont Southworth of the Los Angeles Times is now evaluating commercial packages after relying on homegrown software for more than a year. ``Ultimately, I don't care to be in the business of writing software," he says. ``I want to run a Web site."

Commercial packages can cost from $500 to upwards of $15,000, which is where Accrue Insight, a cutting-edge analysis tool that lets Webmasters examine traffic patterns as they're happening, starts. Web sites will spend $47 million on analysis software this year, according to the International Data Corporation, a figure that will rise to $100 million a year by 2001.


Ideally, all of this data massaging would produce a clear understanding of how many individual readers visit the online newspaper, who they are and what they're interested in. The reality is that not a single newspaper Web site knows exactly how many readers (as opposed to the total number of hits a site receives) visit on a given day.

This number, often referred to as ``unique visitors," may be the holy grail of Web measurement. It would help newspaper executives compare the reach and influence of their electronic publications to that of their print publications--and their Web competitors. Tracking visitors and their usage habits also would help newspapers differentiate themselves from other Web sites when selling advertising. Industry observers like the NAA's Conaghan believe that newspaper sites, while attracting smaller audiences than search engines like Yahoo! or technology sites like Netscape, are much better at retaining loyal users who visit frequently and spend more time at the site. Sharon Katz, associate media director at ModemMedia in Westport, Connecticut, one of the country's largest interactive marketing agencies, says Web sites that attract dedicated, repeat visitors are better for her clients. ``This is a one-to-one medium, and we're buying relationships," Katz says.

So the incentive for being able to count unique visitors and gather information about who they are, how often they visit and how long they stay is strong. Sites that capture this information and can present it in a credible way will be better positioned to compete for advertising revenue. Right now, though, companies like ModemMedia that buy ad space on the Web must settle for less-than-ideal information about the audience.

Once a newspaper site has boiled down its server logs, it is left with a set of statistics that may include hits, page requests, visits and visit length. And each of these statistics is poorly defined and subject to technological limitations.

The problem with hits: They include every request made of a Web server, including HTML pages, images, sound and video files, and requests made to programs that search an archive or cast a vote in a survey. So not only is it hard to compare one site to another using hits, it's also hard to evaluate a single site's performance over time, as new graphics are added. ``Hits are really a fictitious number," says the L.A. Times' Southworth, ``since they can be easily manipulated."

With hits dethroned, most sites have begun to count page requests. Page requests (also called page impressions and page views) attempt to tally only the number of actual pages seen by a user, without including images or other components of a page.

Page requests are useful because they allow newspapers to actually gather information about how many pages the average reader looks at on a given day and determine which stories draw the most reader interest. The Tribune Co.'s Fuller says he is excited about using such information to better understand and serve his readers, noting, ``It's very expensive to get that kind of information about the ink-on-paper product."

Unfortunately, a number of problems surround page requests. Often, newspaper sites use ``frames" to divide a page into multiple regions--one area for a table of contents, one for an ad banner, one for the masthead and one for the content. Such a page is actually constructed of five separate HTML pages, though, and is counted in the server log as multiple page requests.

Common sense would indicate that the example above should only be counted once, since the user has really only asked for one page, not five. Some organizations, like Audit Bureau Verification Services (ABVS), concur. But others, like I/PRO, count the main page and each of the separate frame regions as distinct pages.

And what happens when a user asks for a page--clicking from the front page of the Internet Tribune to the travel section, say--but then gets impatient waiting for the new page to appear and hits the browser's ``stop" button? The server log records a request for a page that the user never actually sees. Web sites are forced to count interrupted requests just as they would count successful ones, because there's no way to tell the difference. Unless, that is, the site has purchased Accrue's pricey Insight software, which keeps tabs on how often users reach for the ``stop" button. But as some sites begin to use Accrue's more accurate technology, and others do not, comparison between various sites becomes even more problematic: Who is still counting interrupted requests, and who is not?

The latest wave of so-called ``push" technologies (see ``When Push Comes to News," May), which enable a user to specify his interests and then receive automatic updates daily or hourly, raises other questions. A push technology like Netscape's InBox Direct delivers Web pages to a user's electronic mailbox. But what if a user never even opens those messages, finding it easier to delete them than to unsubscribe? The newspaper counts these ``push" deliveries as page requests.

But perhaps the biggest barrier to accurately sizing the Web audience is a technique known as proxy caching. This technique, used by online services like America Online and also by many corporate networks and Internet service providers, aims at making Web content more accessible. By making automated copies of popular Web sites, an Internet provider can save its users time and frustration since they no longer have to venture out into the wilds of the Web to visit the sites they want. But when America Online caches a newspaper's site, it becomes impossible for the newspaper to track how many users see the copy of the site.

``Caching is a whole nightmare of its own," says the L.A. Times' Southworth, who is looking into a number of ways to prevent online services from storing copies of his site. He says that when the Hollywood Online site, which also is owned by the Los Angeles Times, implemented a technique to thwart caching, it saw a 10 to 15 percent increase in traffic to its server.

S TEPPING INTO THE MIDDLE OF THIS measurement morass are auditing companies that aim to validate a site's usage statistics. The three major players in the auditing field, San Francisco-based I/PRO, a 1995 start-up, and two of the established ink-on-paper auditors, BPA Interactive and the Audit Bureau of Circulations' new for-profit arm, ABVS, all take similar approaches.

Each service gets a copy of the site's server log, examines the data for tampering and irregularities, produces a set of statistics on the site's usage (typically focused on the number of page requests) and gives those numbers a stamp of approval. Auditing fees range from $400 to over $1,500 a month, based on a site's traffic.

The key drawback with auditing services is that they rely on a site's server logs, and thus are subject to all the stickiness associated with trying to get good intelligence from raw data. Most audits, for example, note that their certified tally of page requests does not include material that has been cached by America Online or another service. Most have trouble calculating page requests for sites designed with frames. And not one of the auditors attempts to translate hits and page requests into a number of actual readers who have visited the site--the much sought-after tally of ``unique visitors."

I/PRO does try to determine the number and length of visits to a site, in contrast to its competitors. But several uncontrollable factors can skew these numbers. In an environment where computers are shared, like a cyber-caf* or a university computer lab, one user may start a visit to a newspaper Web site, click around for a while, then leave. When another user sits down, sees something interesting on the screen and begins to interact with the same site, there's no way to distinguish when one visit ended and another began.

The industry does appear to be settling on a 30-minute ``time out" period, which means that if the computer was inactive for 30 minutes before the second user sat down, then the second visit would be counted separately from the first. Unfortunately, this also means that if a user is idle for more than 30 minutes--talking on the phone or working with another application--before he returns to his Web browser and continues using the same site, his actions will be counted as a new visit.

While a number of newspaper sites have experimented with audits to see how their own internal numbers stand up to third-party analysis, few have signed long term deals. Stephen Luciani, general manager of the New York Times on the Web, has purchased I/PRO audits occasionally for purposes of comparison, but, he says, ``The advertisers haven't found a need to question us." The day when advertisers demand third-party audits may not be far off, however, according to Sharon Katz of ModemMedia. ``The understanding is only going to last so long before people say, `This is real money I'm spending, and I want some accountability,' " she says.

I N ADDITION TO PROVIDING ADVERTISERS with accountability, online papers are beginning to recognize the importance of collecting demographic information about their Web readers. The newspapers that have already instituted some form of registration, including the Los Angeles Times, the Wall Street Journal and the New York Times, have uniformly decided to ask about users' age, sex, location and income before doling out passwords. And a number of other papers, including the Chicago Tribune, the Washington Post and the entire Knight-Ridder chain, are considering registration as a way to more accurately count readers and build a demographic database of who those readers are.

But registration has its limitations. The New York Times on the Web, for example, has set up over 1.2 million accounts since its January 1996 launch and sees 2,000 to 6,000 new registrations every day, according to Luciani. But how many of those 1.2 million accounts were opened by people who forgot their password and were forced to re-register?

If the Times potentially overcounts its absentminded visitors, the Wall Street Journal's Interactive Edition may undercount its. Since subscriptions to the online Journal cost either $29 or $49 a year, depending on whether a user subscribes to the print version, forgotten passwords aren't as much of a problem as shared passwords. Reports of entire companies piggybacking on a single subscription to the online Journal are widespread. So the Journal's count of more than 100,000 subscriptions as of May actually understates the number of unique visitors.

As other papers evaluate registration as one option for more accurately tracking readers on the Web, the big question is: Will it drive off current and potential readers? ``My gut feeling is that four out of 10 users are turned away by registration," says the New York Times' Luciani.

No matter what type of measurement system a newspaper employs, be it server log analysis, auditing or registration, it remains impossible to compare one newspaper's online readership to another's until definitive industry standards emerge. That's where research companies like PC Meter come into the picture.

Similar to the way the Nielsen ratings work, PC Meter, a subsidiary of consumer research giant NPD Group, has installed software on the computers in 10,000 volunteer U.S. households to monitor the usage of each machine--whether it's Junior playing a CD-ROM game, Mom surfing the Web, or Pop using Quicken to manage the family's finances. As a result, PC Meter's research yields information that no single site could gather--like the average amount of time people spend surfing the Web per session (30 minutes). PC Meter can even track which sites users have bookmarked--something analysis software and auditors can't currently do--and calculate how often a user visits bookmarked versus non-bookmarked sites.

The drawback to PC Meter's research, which can range in price from a few thousand dollars for an individual report to tens of thousands of dollars for an annual subscription, is that it is based entirely on household usage. While the company plans to begin measuring activity at businesses, some in the industry raise another question: Is a sample size of 10,000 households large enough to capture usage trends at the millions of existing Web sites?

There is a consensus in the industry that, as the Web becomes more important to bottom-line revenue, usage tracking will improve in accuracy and sophistication. ``The industry has done a whole lot over the last 12 months to develop standards," says Tim Reed of I/PRO. ``There's a lot of interest and energy and momentum right now."

Jack Fuller of the Tribune Co. and others point out that despite all the inconsistencies surrounding Web measurement today, one fact is hard to dispute: Newspapers, for the first time in years, actually have a new audience to measure. ``You see this online audience ramping up more quickly than anything we've seen in newspapers over almost my whole career," the 50-year-old Fuller says. ``It's exciting, because it's the first time that I've sensed the possibilities of dramatically increasing the audience for good journalism. That's one of the things that makes it so thrilling."

###