Analysing Web Usage Logs

Introduction

The web server keeps a record of every file that is requested by a browser, including requests for non-existent files. This information is stored in log files, one for each day (midnight to midnight). Each line in a log file represents a request for a single file. Note that a single web page (.HTM, .HTML, or .ASP file) may refer to many other files, typically graphical images (.GIF, .JPG, etc.). Each file requested (i.e. each line of the log file) is referred to as a single hit. This can lead to an overestimation of how busy our site is since individual pages may request different numbers of supporting files, but we are interested only in the number of complete pages viewed.

It is important to think in terms of page requests (how many pages were looked at) rather than hits (how many files were requested) to get an accurate picture of how busy the site is.

Daily Report

This report provides a day-by-day plot of the total number of pages requested. Look here for unusually busy or quiet days. A high number of requests might indicate a hack attempt or misbehaving bot or downloader, whereas a low number suggests that the server or the network was not performing properly.

Status Code

This report indicates the distribution of the various outcomes of all hits. We want the values to be at least consistent, but there should be few codes beginning in the 400 and 500 range, since these indicate errors.

Processing Time Report

This report indicates the distribution of time taken for each page to be delivered to the host (visitor). Again we want consistency here, and preferably the majority of requests to have taken only a short time. Note that large pages will naturally take a long time to deliver. If there is a sudden change in the distribution, or a large number of pages are taking a long time, then something is amiss.

Request Report

This report shows which pages were requested. It's not so useful for troubleshooting (except when a misbehaving visitor is particularly interested in a selection of files), but it is useful for identifying popular and unpopular files, so that we can maximise the popularity of our pages.

By monitoring failed attempts, we can discover obsolete content and broken links.

File Type Report

This is not the most useful report, but it does give an overview of the distribution of different file types being requested. The values should be fairly consistent, so look for sudden changes to detect problems.

Host Report

This lists the IP names or numbers of the computers used to visit our web site. Note that this can be misleading; sometimes, the machine is actually a proxy server through which multiple different people may visit our site. This report can also be used to spot misbehaving machines since they will have a disproportionately high number of hits (although the machine might be a proxy server).

Referrer Report

This report lists the pages that brought visitors to our site by having links to our site on their pages. It is useful to identify these sites in order to increase the popularity of our site. Also, it should be monitored for log-spammers; these are programs that generate hits on our web site pretending to be from some site, but the site is usually a marketing site and the hits are intended merely to create a link to the marketing site in our site's pages (the usage statistics), making it look as if we endorse their site.

Browser Summary and Browser Report

These reports list the browser software used by our visitors. The report is a detailed breakdown of the most common browsers. The report is particularly useful for identifying robots, whilst the summary is most useful for ensuring that our web pages are compatible with our visitor's software since there is variation between the various browser's interpretation of web pages.

Operating System Report

This report lists the operating system running on the computer that made a request to our web site. It is not critical, but does provide an additional means of seeing what software our visitors are running so that we can ensure our web pages are compatible with our visitors' systems.

Automating the Analysis

To run the analysis configuration automatically, we use a batch file. The batch file calls analog and specifies the correct log files to process. To make the processing faster, we tell Analog to look at only the log files within the dates that we are interested in. Our analysis is run on the previous ten days, so we look at all log files from the current and previous months (since the last ten days may include days in last month). We could be more sophisticated and specify only log files from the last ten days, but this would require specifying each day number meaning that every file would have to be mentioned in the parameter list.

To specify the log file parameters, we obtain the date in yymmdd format and place it into an environment variable. This is achieved using two utility programs I wrote for the purpose (ncecho, and ymd). Creating the environment variable is performed in a separate, temporary, batch file. Once Analog has all the log files it needs, it is easy to restrict the analysis period to the previous ten days, using the following commands in the configuration file:


FROM -00-00-10
TO -00-00-01

We run a variety of different analyses, each with its own configuration file. The individual configuration files override the settings in the default configuration file (analog.cfg). The main part of the batch file is shown below:


@ncecho "@set NOW="> tmp_set.bat
@ymd -d0 >> tmp_set.bat
@call tmp_set.bat
@del tmp_set.bat

@ncecho "@set THN="> tmp_set.bat
@ymd -m-1 -d0 >> tmp_set.bat
@call tmp_set.bat
@del tmp_set.bat

analog.exe w3svc1\ex%THN%*.log w3svc1\ex%NOW%*.log +grecent_tech.cfg

This runs analog such that it generates a report that uses the default configuration file, but with specific configurations stored in the recent_tech.cfg file. "+g" tells Analog to also take settings from the default configuration file (analog.cfg). The default configuration file is useful for setting options that are the same for the separate reports.

There are extra considerations in using multiple configuration files. For instance, if the same option has different values in the specific and the default configuration files then they may interact or Analog may use the last setting. For instance, if the default configuration file specifies that FILEINCLUDE=*, then setting FILEEXCLUDE=*.jpg in the specific configuration file will have no effect.

Configure analog.cfg

analog.cfg is the configuration file for Analog. It tells Analog, which files to include in the analysis and which analyses to perform, and many other things. We need to change many of these settings from the original `factory' settings, so as to suit our own web site. A commented sample Analog configuration file, based on our own one is available here. Analog's web site is here.

In this file, we can include some other minor-configuration files. First, the SearchQuery.txt file is a comprehensive list of search engines that tells Analog how to interpret hits coming from these engines. This file can be downloaded from ???- WHERE -???. In addition, we can include a file containing the names of known robots, so that Analog can identify them.

The specific configuration files need contain only those options that are different from the default values.

Manual Analyses

Sometimes, there will be unusual activity in the standard analyses that one wishes to investigate. For instance, there might be a unusually high number of hits on a single day or from a single host (visitor). In these cases, one can perform a special analyses to determine the basis for the unusual activity. For instance, by analysing only hits for the busy day one can discover where the hits were coming from (the visitor) and what pages they were requesting. Similarly, one can determine when the busy visitor arrived and what they were looking at.

As an example, we investigate activity from a single host with the following command:


analog.exe +gsingle_host.cfg

In the configuration file, we have the configuration below. We exclude all log files and then include only the ones we require, which will refer to the dates on which the host hit our web site (refer to Analog's documentation on inclusions and exclusions for an explanation). We then exclude all hosts, apart from the one we're interested in. We also need to override some settings made in the default configuration file, although we could do that by using the -G command line switch.


HOSTNAME "Single Host"

OUTFILE singlehost.htm
ERRFILE errors.txt

LOGFILE none
LOGFILE w3svc1/ex0807*.log

HOSTEXCLUDE *
HOSTINCLUDE 127.0.0.1

BROWINCLUDE *
FILEINCLUDE *

BROWSERREP ON
DAILYREP ON
FAILURE ON
FAILREF ON
HOST ON
MONTHLY ON
REFERRER ON
REQCOLS NRr
REQFLOOR 1r
REQUEST ON

We are most interested in what people are looking at on our web pages; unfortunately, many of the hits are generated by automated software. Consequently, we should remove these hits from the analyses. In order to do this, we need some means of identifying automatic hits. The easiest way to do this is to use the user-agent field (Analog refers to the value of this field as the Browser), which usually indicates that it is a `bot', `crawler' or `spider', etc., so we can filter out hits where the browser contains strings like these.

Another way to identify automatic hits is to check the requests for robots.txt (FILEXCLUDE *, FILEINCLUDE robots.txt) since most of the hits for this come from automated software; this is not always true, however, so be careful not to assume that they are all automatic. Yet another method is to look for unusually consistent behaviour (e.g. a sequence of hits from the same host separated by a very uniform time delay), or a large amount of hits from the same host. Basically, any behaviour that doesn't looks unnatural for a human.

qdns.exe

The server is configured to perform a reverse DNS lookup to find the name belonging to the IP number of the visitor. However, this slows the server down, so if speed were important, we could use the default option of storing only the physical internet number of the visitor. To convert these numbers into their associated names, run the MS-DOS prompt program, and then run the qdns program by entering the command: resolve.bat.

QDNS will search through the log files, extract all the IP numbers, and copy them to a separate file. It will then find the names for all the numbers (some may not have names) and stores these in the results file, next to their numbers. This file can then be used by analog to summarise the origins of the users looking at our web site.


Home About Me
Copyright © Neil Carter

Content last updated: 2008-07-22