The Webalizer is an application that generates web pages of analysis, from access and usage logs, i.e. it is web log analysis software. It is one of the most commonly used web server administration tools. It was initiated by Bradford L. Barrett in 1997. Statistics commonly reported by Webalizer include: hits; visits; referrers; the visitors' countries; and the amount of data downloaded. These statistics can be viewed graphically and presented by different time frames, such as per day, hour, or month.
Overview
Website traffic analysis is produced by grouping and aggregating various data items captured by the web server in the form of log files while the website visitor is browsing the website. Some of the most commonly used website traffic analysis terms are listed below:
URL
A Uniform Resource Locator (URL) uniquely identifies the resource requested by the user's browser.
Hit
Each HTTP request submitted by the browser is counted as one hit. Note that HTTP requests may be submitted for non-existent content, in which case they still will be counted. For example, if one of the five image files of a website is missing, the web server will still count six HTTP requests, but in this case, five will be marked as successful (one HTML file and four images) and one as a failed request (the missing image)
Page
A page is a successful HTTP request for a resource that constitutes primary website's content. Pages are usually identified by a file extension (e.g. .html, .php, .asp, etc.) or by a missing extension, in which case the subject of the HTTP request is considered a directory and the default page for this directory is served.
File
Each successful HTTP request is counted as a file.
Visitor
A visitor is the actual person browsing the website. A typical website serves content to anonymous visitors and cannot associate visitors with the actual person browsing the website. Visitor identification may be based on their IP address or an HTTP cookie. The former approach is simple to implement, but results in all visitors browsing the same website from behind a firewall counted as a single visitor. The latter approach requires special configuration of the web server (i.e. to log HTTP cookies) and is more expensive to implement. Note that neither of the approaches identifies the actual person browsing the website and neither provides 100% accuracy in determining that the same visitor has visited the website again.
Visit
A visit is a series of HTTP requests submitted by a visitor with the maximum time between requests not exceeding a certain amount configured by the webmaster, which is typically set at 30 minutes. For example, if a visitor requested page A, then in 10 minutes page B and then in 40 minutes page C, then this visitor has generated two visits, one when pages A and B were requested and another when the page C was requested.
Host
In general, a host is the visitor's machine running the browser. Hosts are often identified by IP addresses or domain names. Those web traffic analysis tools that use IP addresses to identify visitors use the words hosts, domain names and IP addresses interchangeably.
User Agent
User agent is a synonym for a web browser.
In order to illustrate the difference between hits, pages and files, let's consider a user requesting an HTML file referring to five images, one of which is missing. In this case the web server will log six hits (i.e. one successful for the HTML file itself and four for successfully retrieved images and one for the missing image), five files (i.e. five successful HTML requests) and one page (i.e. the HTML file).
Log file types
The Webalizer analyzes web server log files, extracting such items as client's IP addresses, URL paths, processing times, user agents, referrers, etc. and grouping them in order to produce HTML reports.
Web servers log HTTP traffic using different file formats. Most popular file formats are CLF, the Apache Custom Log Format and W3C Extended Log File Format. An example of a CLF log line is shown below.
192.168.1.20 - - [26/Dec/2006:03:09:16 -0500] "GET HTTP/ 1.1" 200 1774
Apache Custom Log Format can be customized to log most HTTP parameters, including request processing time and the size of the request itself. The format of a custom log is controlled by the format line. A typical Apache log format configuration is shown below.
LogFormat "%a %l \"%u\" %t %m \"%U\" \"%q\" %p %>s %b %D \"%{Referer}i\" \"%{User-Agent}i\"" my_custom_log CustomLog logs/access_log my_custom_log
Microsoft's Internet Information Services (IIS) web server logs HTTP traffic in W3C Extended Log File Format. Similarly to Apache Custom Log format, IIS logs may be configured to capture such extended parameters as request processing time. W3C extended logs may be recognized by the presence of one or more format lines, such as the one shown below.
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-bytes cs-bytes time-taken
The original version of The Webalizer can process CLF log files, as well as HTTP proxy log files produced by Squid servers. Other log file formats are usually converted to CLF in order to be analyzed. Some of the forks listed in the External Links section below are capable of processing IIS and Apache log files without having to convert them to CLF first.
Command line
The Webalizer is a command line application and is launched from the OS shell prompt. A typical command is shown below.
webalizer -p -F clf -n web.ictea.com -o reports logfiles/access_log
This command instructs The Webalizer to analyze the log file access_log, run in the incremental mode (-p), interpret the log as a CLF log file (-F), use the domain name web.ictea.com for report links (-n) and produce the output subdirectory of the current directory.
Use the -h option to see the complete list of command line options.
Configuration
Besides the command line options, the Webalizer may be configured through parameters of a configuration file. By default, The Webalizer reads the file webalizer.conf and interprets each line as a processing instruction. Alternatively, a user-specified file may be provided using the -c option.
For example, if the webmaster would like to ignore all requests made from a particular group of hosts, he or she can use the IgnoreSite parameter to discard all log records with the IP address matching the specified pattern:
IgnoreSite 192.168.0.*
There are over one hundred available configuration parameters, which make The Webalizer a highly configurable web traffic analysis application. For a complete list of configuration parameters please refer to the README file shipped with every source or binary distribution.
Reports
By default, The Webalizer produces two kinds of reports - a yearly summary report and a detailed monthly report, one for each analyzed month.
The yearly summary report provides such information as the number of hits, file and page requests, hosts and visits, as well as daily averages of these counters for each month. The report is accompanied by a yearly summary graph.
Each of the monthly reports is generated as a single HTML page containing a monthly summary report (listing the overall number of hits, file and page requests, visits, hosts, etc.), a daily report (grouping these counters for each of the days of the month), an aggregated hourly report (grouping counters for the same hour of each day together), a URL report (grouping collected information by URL), a host report (by IP address), website entry and exit URL reports (showing most common first and last visit URLs), a referrer report (grouping the referring third-party URLs leading to the analyzed website), a search string report (grouping items by search terms used in such search engines as Google), a user agent report (grouping by the browser type) and a country report (grouping by the host's country of origin).
Each of the standard HTML reports described above lists only top entries for each item (e.g. top 20 URLs). The actual number of lines for each of the reports is controlled by configuration. The Webalizer may also be configured to produce a separate report for each of the items, which will list every single item, such as all website visitors, all requested URLs, etc.
In addition to HTML reports, The Webalizer may be configured to produce comma-delimited dump files, which list all of the report data in a plain-text file. Dump files may be imported to spreadsheet applications or databases for further analysis.
Internationalization
HTML reports may be produced reports in over 30 languages, including Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Indonesian, Italian, Japanese, Korean, Latvian, Malay, Norwegian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian, Serbian, Simplified Chinese, Slovak, Slovene, Spanish, Swedish, Turkish, Ukrainian.
To generate reports in an alternate language requires a separate webalizer binary compiled specifically for that language.
Criticism
- Generated statistics do not differentiate between human visitors and robots. As a result all reported metrics are higher than those due to people alone. Many webmasters claim that webalizer produces highly unrealistic figures of visits, which are sometimes 200 to 900% higher than the data produced by javascript based web statistics such as Google Analytics or StatCounter.
- Reported hits are too high for download managers with segmented downloads, each 206 "Partial Content" is reported as one hit.
- No query string analysis. Dynamic generated websites can not be listed separately (e.g. PHP pages with arguments).
Glossary
Main Headings
Hits represent the total number of requests made to the server during the given time period (month, day, hour etc..).
Files represent the total number of hits (requests) that actually resulted in something being sent back to the user. Not all hits will send data, such as 404-Not Found requests and requests for pages that are already in the browsers cache.
Tip: By looking at the difference between hits and files, you can get a rough indication of repeat visitors, as the greater the difference between the two, the more people are requesting pages they already have cached (have viewed already).
Sites is the number of unique IP addresses/hostnames that made requests to the server. Care should be taken when using this metric for anything other than that. Many users can appear to come from a single site, and they can also appear to come from many IP addresses so it should be used simply as a rough gauge as to the number of visitors to your server.
Visits occur when some remote site makes a request for a page on your server for the first time. As long as the same site keeps making requests within a given timeout period, they will all be considered part of the same Visit. If the site makes a request to your server, and the length of time since the last request is greater than the specified timeout period (default is 30 minutes), a new Visit is started and counted, and the sequence repeats. Since only pages will trigger a visit, remotes sites that link to graphic and other non- page URLs will not be counted in the visit totals, reducing the number of false visits.
Pages are those URLs that would be considered the actual page being requested, and not all of the individual items that make it up (such as graphics and audio clips). Some people call this metric page views or page impressions, and defaults to any URL that has an extension of .htm, .html or .cgi.
A KByte (KB) is 1024 bytes (1 Kilobyte). Used to show the amount of data that was transfered between the server and the remote machine, based on the data found in the server log.
Common Definitions
A Site is a remote machine that makes requests to your server, and is based on the remote machines IP Address/Hostname.
URL - Uniform Resource Locator. All requests made to a web server need to request something. A URL is that something, and represents an object somewhere on your server, that is accessable to the remote user, or results in an error (ie: 404 - Not found). URLs can be of any type (HTML, Audio, Graphics, etc...).
Referrers are those URLs that lead a user to your site or caused the browser to request something from your server. The vast majority of requests are made from your own URLs, since most HTML pages contain links to other objects such as graphics files. If one of your HTML pages contains links to 10 graphic images, then each request for the HTML page will produce 10 more hits with the referrer specified as the URL of your own HTML page.
Search Strings are obtained from examining the referrer string and looking for known patterns from various search engines. The search engines and the patterns to look for can be specified by the user within a configuration file. The default will catch most of the major ones.
Note: Only available if that information is contained in the server logs.
User Agents are a fancy name for browsers. Netscape, Opera, Konqueror, etc.. are all User Agents, and each reports itself in a unique way to your server. Keep in mind however, that many browsers allow the user to change it's reported name, so you might see some obvious fake names in the listing.
Note: Only available if that information is contained in the server logs.
Entry/Exit pages are those pages that were the first requested in a visit (Entry), and the last requested (Exit). These pages are calculated using the Visits logic above. When a visit is first triggered, the requested page is counted as an Entry page, and whatever the last requested URL was, is counted as an Exit page.
Countries are determined based on the top level domain of the requesting site. This is somewhat questionable however, as there is no longer strong enforcement of domains as there was in the past. A .COM domain may reside in the US, or somewhere else. An .IL domain may actually be in Isreal, however it may also be located in the US or elsewhere. The most common domains seen are .COM (US Commercial), .NET (Network), .ORG (Non-profit Organization) and .EDU (Educational). A large percentage may also be shown as Unresolved/Unknown, as a fairly large percentage of dialup and other customer access points do not resolve to a name and are left as an IP address.
Response Codes are defined as part of the HTTP/1.1 protocol (RFC 2068; Chapter 10). These codes are generated by the web server and indicate the completion status of each request made to it.