Introduction
Every ISP needs a log file analysis program. One of the best is the
Webalizer, an open source product originally written in Perl
and rewritten in C soon thereafter. It can parse both Common Log Format and
Combined Log Format files at a blistering pace. One benchmark states
that On a 200Mhz pentium machine, over 10,000 records can be processed
in one second, with a 40 Megabyte file taking roughly 15 seconds (over
150,000 records). When one of my consulting clients approached me
looking for an open source log file analysis program to replace their
proprietary one, Webalizer was my top recommendation.
Getting Started:
Installing webalizer can be a daunting task because one quickly runs
into dependency problems. Webalizer requires the gd graphics library
which in turn relies upon the jpeg, png, and zlib compression libraries.
Further complicating the installation is that most of these libraries
eschew the standard ./configure, make, make install syntax that is
typical for most open source installations. I will, therefore, take a
few moments to sketch out the installation process.
It is recommended that every one of these libraries be downloaded and
extracted into a common directory from which each can be built and
installed. It is particularly important for the png and zlib libraries
to be extracted in the same top level directory prior to building and
installing them. The order of installation is important.
First, download the jpeg library from
http://www.ijg.org/, unzip and untar the
files and cd into the jpeg-6b directory.
./configure
make
make install
cd ..
gunzip zlib*.tar.gz
tar -xvpf zlib-1.1.3.tar
mv zlib-1.1.3 zlib
cd zlib
./configure
make
make test
make install
cd ..
Third, download libpng from
http://www.libpng.org/.
gunzip libpng*.tar.gz
tar -xvpf libpng-1.0.10.tar
mv libpng-1.0.10 libpng
cd libpng
The png library has a number of makefiles in the scripts directory that
are specific to different architectures. In my case:
cp scripts/makefile.linux makefile
make
make test
make install
cd ..
You may, optionally, choose to download and install freetype fonts. It
is often difficult, however, to connect to the main sourceforge
site, but a list of mirrors can be found at:
http://www.freetype.org/download.html.
I have found that freetype-2.0.1.tar.gz compiled without any difficulty.
After extracting freetype-2.0.1.tar.gz:
cd freetype*
make setup
make
make install
cd ..
Now we are ready to compile the gd graphics library. You may download
gd-1.8.4.tar.gz from
http://www.boutell.com/gd/.
cd gd-1.8.4
#edit the Makefile if you want Xpm or TrueType support
make
Some individuals have found that compiling php with gd support may not
work right unless previous versions of gd.h are removed. Because of
this and other difficulties I have had with older copies of gd.h being
in my path, I recommend that you locate and archive any previous
versions of gd.h prior to typing 'make install'.
Finally, you may download webalizer from
http://www.mrunix.net/webalizer/.
cd webalizer-2.01-06
./configure
make
make install
If you are not compiling from source, be
sure that you archive /etc/webalizer.conf. The source
code will not create this file but some RPMS may
install a sample webalizer.conf. The short explanation is that
webalizer searches for webalizer.conf by default and it may interfere with
the script that we are going to write a little later.
mv /etc/webalizer.conf /etc/webalizer.conf.working07292001
Configuring Apache:
Most ISP's that use Apache have set up
their domains as virtual hosts in their Apache configuration file
(httpd.conf). By default, Apache logs everything to access_log and
error_log (and perhaps access_log-ssl and error_log-ssl if you use
mod_ssl). In order to properly analyze log files for your ISP's clients
it is necessary to add a CustomLog directive within each virtualhost
creating a separate log file for each domain. Take my domain for
example:
<VirtualHost 64.240.90.200:80>
ServerAdmin webmaster@hamptonandassociates.net
DocumentRoot /www/handassoc/public_html
Servername hamptonandassociates.net
ServerAlias www.hamptonandassociates.net
</VirtualHost>
Between the VirtualHost tags that define my web-site, I wanted to add the
following line:
CustomLog /var/log/httpd/www.hamptonandassociates.net-access_log combined
Since reading Darrel Brogdon's article
Using PHP
as a Shell Scripting Language, I have been using PHP for most
scripting tasks. Fortunately, my client had been fairly consistent
in how he had constructed his Servername and ServerAlias directives for
each virtual host. If you are a Perl wonk or a sed fanatic you
could probably do this with fewer lines of code, but I think my PHP
solution is easier to read:
Script 1.
Briefly, this script opens the apache configuration file and reads each
line. If the line is a comment, it simply writes the line to the output
file. If the line is not a comment it looks for the first VirtualHost
directive. If it has come to the first VirtualHost directive (which is
usually toward the bottom of most apache configuration files) then it
starts to look for a ServerName directive. When it finds this
directive, it appends the custom logfile directive to the end of the
ServerName line with the appropriate log file name.
After running this script and ensuring that the output is satisfactory,
you can replace your current httpd.conf file. I also like to switch
HostnameLookups to On but this can cause trouble with some high volume
sites because it tells apache to do a DNS lookup for every client that
connects to the webserver.
You can compile webalizer to do
DNS lookups for you when it runs, but this requires
that you to download and compile the Berkeley DB
package available at
http://www.sleepycat.com/. See
the DNS.README file that comes with webalizer for more
details.
Don't forget to restart apache after you finish editing httpd.conf.
Creating Webalizer's Output Directories:
Next, I needed to create directories under each domain's document root
to store the output from webalizer. On my client's computer the
DocumentRoot's were organized like: /www/some_domain/public_html.
Here is the script I wrote to do this:
Script 2.
Next, I created a file that I called
webalizer.min. This file contained the configuration directives that would be common
for all of the domains.
The Readme file in the webalizer directory contains a wealth of
information about each of these options. Basically, the PageType
directives tell webalizer to count hits on files with the .htm, .html,
.cgi, .php,.php3, and .pl extensions as actual 'visits'. The
CountryGraph is useless unless HostNameLookups is on in your Apache
configuration -- it produces a nice pie chart broken down by the country
from which traffic to your site originates. GroupReferrer enables
webalizer to group together the results from major domains. In my case,
I had certain primary search engines for which I thought it would be
instructive to group the results.
Finally, I created a cron script to run webalizer.
Script 3. After running the script
once, I inspected the output to ensure that it was functioning properly
by pointing a browser at http://some_domain/usage/.
I put the script on autopilot by adding a line to my /etc/crontab like
the following:
Webalizer is a superior log file analysis tool that is fast and free.
With a little intelligent planning and liberal use of PHP, any ISP can
begin to offer statistics to the owners of each domain that they host.