picture of Rodney Hampton

Introduction

Every ISP needs a log file analysis program. One of the best is the Webalizer, an open source product originally written in Perl and rewritten in C soon thereafter. It can parse both Common Log Format and Combined Log Format files at a blistering pace. One benchmark states that On a 200Mhz pentium machine, over 10,000 records can be processed in one second, with a 40 Megabyte file taking roughly 15 seconds (over 150,000 records). When one of my consulting clients approached me looking for an open source log file analysis program to replace their proprietary one, Webalizer was my top recommendation.

Getting Started:

Installing webalizer can be a daunting task because one quickly runs into dependency problems. Webalizer requires the gd graphics library which in turn relies upon the jpeg, png, and zlib compression libraries. Further complicating the installation is that most of these libraries eschew the standard ./configure, make, make install syntax that is typical for most open source installations. I will, therefore, take a few moments to sketch out the installation process.
It is recommended that every one of these libraries be downloaded and extracted into a common directory from which each can be built and installed. It is particularly important for the png and zlib libraries to be extracted in the same top level directory prior to building and installing them. The order of installation is important.
First, download the jpeg library from http://www.ijg.org/, unzip and untar the files and cd into the jpeg-6b directory.
./configure
make
make install
cd ..
Second, download zlib from http://www.info-zip.org/pub/infozip/zlib/.
gunzip zlib*.tar.gz
tar -xvpf zlib-1.1.3.tar
mv zlib-1.1.3 zlib
cd zlib
./configure
make
make test
make install
cd ..
Third, download libpng from http://www.libpng.org/.
gunzip libpng*.tar.gz
tar -xvpf libpng-1.0.10.tar
mv libpng-1.0.10 libpng
cd libpng
The png library has a number of makefiles in the scripts directory that are specific to different architectures. In my case:
cp scripts/makefile.linux makefile
make
make test
make install
cd ..
You may, optionally, choose to download and install freetype fonts. It is often difficult, however, to connect to the main sourceforge site, but a list of mirrors can be found at: http://www.freetype.org/download.html.
I have found that freetype-2.0.1.tar.gz compiled without any difficulty. After extracting freetype-2.0.1.tar.gz:
cd freetype*
make setup
make
make install
cd ..
Now we are ready to compile the gd graphics library. You may download gd-1.8.4.tar.gz from http://www.boutell.com/gd/.
cd gd-1.8.4
#edit the Makefile if you want Xpm or TrueType support
make
Some individuals have found that compiling php with gd support may not work right unless previous versions of gd.h are removed. Because of this and other difficulties I have had with older copies of gd.h being in my path, I recommend that you locate and archive any previous versions of gd.h prior to typing 'make install'.
Finally, you may download webalizer from http://www.mrunix.net/webalizer/.
cd webalizer-2.01-06
./configure
make
make install
If you are not compiling from source, be sure that you archive /etc/webalizer.conf. The source code will not create this file but some RPMS may install a sample webalizer.conf. The short explanation is that webalizer searches for webalizer.conf by default and it may interfere with the script that we are going to write a little later.
mv /etc/webalizer.conf /etc/webalizer.conf.working07292001

Configuring Apache:

Most ISP's that use Apache have set up their domains as virtual hosts in their Apache configuration file (httpd.conf). By default, Apache logs everything to access_log and error_log (and perhaps access_log-ssl and error_log-ssl if you use mod_ssl). In order to properly analyze log files for your ISP's clients it is necessary to add a CustomLog directive within each virtualhost creating a separate log file for each domain. Take my domain for example:

<VirtualHost 64.240.90.200:80>
ServerAdmin webmaster@hamptonandassociates.net
DocumentRoot /www/handassoc/public_html
Servername hamptonandassociates.net
ServerAlias www.hamptonandassociates.net
</VirtualHost>
Between the VirtualHost tags that define my web-site, I wanted to add the following line:
CustomLog /var/log/httpd/www.hamptonandassociates.net-access_log combined
Since reading Darrel Brogdon's article Using PHP as a Shell Scripting Language, I have been using PHP for most scripting tasks. Fortunately, my client had been fairly consistent in how he had constructed his Servername and ServerAlias directives for each virtual host. If you are a Perl wonk or a sed fanatic you could probably do this with fewer lines of code, but I think my PHP solution is easier to read: Script 1.
Briefly, this script opens the apache configuration file and reads each line. If the line is a comment, it simply writes the line to the output file. If the line is not a comment it looks for the first VirtualHost directive. If it has come to the first VirtualHost directive (which is usually toward the bottom of most apache configuration files) then it starts to look for a ServerName directive. When it finds this directive, it appends the custom logfile directive to the end of the ServerName line with the appropriate log file name.
After running this script and ensuring that the output is satisfactory, you can replace your current httpd.conf file. I also like to switch HostnameLookups to On but this can cause trouble with some high volume sites because it tells apache to do a DNS lookup for every client that connects to the webserver. You can compile webalizer to do DNS lookups for you when it runs, but this requires that you to download and compile the Berkeley DB package available at http://www.sleepycat.com/. See the DNS.README file that comes with webalizer for more details.
Don't forget to restart apache after you finish editing httpd.conf.

Creating Webalizer's Output Directories:

Next, I needed to create directories under each domain's document root to store the output from webalizer. On my client's computer the DocumentRoot's were organized like: /www/some_domain/public_html.
I wanted to add a directory for each domain within this tree like:
/www/some_domain/public_html/usage.
Here is the script I wrote to do this: Script 2.

Configuring Webalizer:

Next, I created a file that I called webalizer.min. This file contained the configuration directives that would be common for all of the domains.
The Readme file in the webalizer directory contains a wealth of information about each of these options. Basically, the PageType directives tell webalizer to count hits on files with the .htm, .html, .cgi, .php,.php3, and .pl extensions as actual 'visits'. The CountryGraph is useless unless HostNameLookups is on in your Apache configuration -- it produces a nice pie chart broken down by the country from which traffic to your site originates. GroupReferrer enables webalizer to group together the results from major domains. In my case, I had certain primary search engines for which I thought it would be instructive to group the results.
Finally, I created a cron script to run webalizer. Script 3. After running the script once, I inspected the output to ensure that it was functioning properly by pointing a browser at http://some_domain/usage/.
I put the script on autopilot by adding a line to my /etc/crontab like the following:
0-59/15 * * * * root /root/webalizer.php > /dev/null

Conclusion:

Webalizer is a superior log file analysis tool that is fast and free. With a little intelligent planning and liberal use of PHP, any ISP can begin to offer statistics to the owners of each domain that they host.
Rodney Hampton is the founder of R.A. Hampton and Associates, an IT consulting firm in Oak Park, Michigan specializing in open source solutions for business.