[ Prof. Harnad ] [ Dr. Carr ] [ Dr. Jiao ] [ S. Hitchcock ] [ T. Brody ] [ E-Prints UK Mirror ]
[ Previous ] [ Home ] [ Next ]

Written by: Ian Hickman

Processing the raw web logs.

We have complete web logs from the UK E-Print mirror for 8 months from 24th August 1999 to 9th May 2000.
To answer the question of how the site is navigated these web logs needed analysed. Some questions that immediately arise are:

  • What does a web log look like?
  • What about unwannted accesses by programs not humans?
  • What about access errors?

What does a web log look like?

This is an example of a web log:
byrne.ecs.soton.ac.ac.uk - - [23/Oct/1999:23:27:00 +0100] "GET /find/astro-ph HTTP/1.0" 200 3588 "-" "Mozilla/4.5 [en] (X11; U; SunOS 5.7 sun4u)"

byrne.ecs.soton.ac.uk - - [23/Oct/1999:23:28:43 +0100] "POST /find HTTP/1.0" 200 6040 "http://xxx.soton.ac.uk/find/astro-ph" "Mozilla/4.5 [en] (X11; U; SunOS 5.7 sun4u)"

byrne.ecs.soton.ac.uk - - [23/Oct/1999:23:29:14 +0100] "POST /findHTTP/1.0" 200 6033 "http://xxx.soton.ac.uk/find/astro-ph" "Mozilla/4.5 [en] (X11; U; SunOS 5.7 sun4u)"

callaghan.ecs.soton.co.uk - - [23/Oct/1999:23:33:54 +0100] "GET / HTTP/1.0" 304 0 "-" "Mozilla/4.7 [en] (X11; I; Linux 2.0.35 i686)"

callaghan.ecs.soton.co.uk - - [23/Oct/1999:23:33:55 +0100] "GET /uk.gif HTTP/1.0" 304 - "http://xxx.soton.ac.uk/" "Mozilla/4.7 [en] (X11; I; Linux 2.0.35 i686)"

The general format is:
"Clients ID" "-" "-" "Date" "HTTP Request" "HTTP Response" "Amount of Data Transferred (in bytes)" "Referring Page" "Clients Details"

The second and third fields are dashes throughout the entire log.

Removing Unwanted Data from the site logs

Unwanted accesses (e.g. spiders and robots) needed to be removed from the logs. This was done by removing any lines that contain the words "cache", "Cache", "spider" or "Spider" or do not contain the word "Mozilla", this should catch the majority of spider accesses and the uploads from the main site to the mirror. The files outputted by this script are in the same format as the inputted files. Originally the logs had 2059223 entries, after running the script they had 1333395.

Sorting the Logs by User

After the unwanted data had been removed the logs could be sorted by user. This would more easily allow tracking of individual users, instead of having many accesses by different users all mixed up within a short time.

Sorting Example

Original Sorted
User A gets page 1 at time 1
User B gets page 1 at time 2
User B gets page 2 at time 3
User A gets page 2 at time 4
User B gets page 3 at time 5
User A gets page 3 at time 6
User A gets page 1 at time 1
User A gets page 2 at time 4
User A gets page 3 at time 6
User B gets page 1 at time 2
User B gets page 2 at time 3
User B gets page 3 at time 5

[ Previous ] [ Top ] [ Home ] [ Next ]