[ Prof. Harnad ]
[ Dr. Carr ]
[ Dr. Jiao ]
[ S. Hitchcock ]
[ T. Brody ]
[ E-Prints UK Mirror ]
[ Previous ]
[ Home ]
[ Next ]
Written by: Ian Hickman
Processing the raw web logs.
We have complete web logs from the UK E-Print mirror for 8 months from 24th August 1999 to
9th May 2000. To answer the question of how the site is navigated these web logs needed analysed. Some questions that
immediately arise are:
- What does a web log look like?
- What about unwannted accesses by programs not humans?
- What about access errors?
What does a web log look like?
This is an example of a web log:
byrne.ecs.soton.ac.ac.uk - - [23/Oct/1999:23:27:00 +0100] "GET /find/astro-ph HTTP/1.0" 200 3588 "-" "Mozilla/4.5 [en] (X11; U; SunOS
5.7 sun4u)"
byrne.ecs.soton.ac.uk - - [23/Oct/1999:23:28:43 +0100] "POST /find HTTP/1.0" 200 6040
"http://xxx.soton.ac.uk/find/astro-ph" "Mozilla/4.5 [en] (X11; U; SunOS 5.7 sun4u)"
byrne.ecs.soton.ac.uk - - [23/Oct/1999:23:29:14 +0100] "POST /findHTTP/1.0" 200 6033
"http://xxx.soton.ac.uk/find/astro-ph" "Mozilla/4.5 [en] (X11; U; SunOS 5.7 sun4u)"
callaghan.ecs.soton.co.uk - - [23/Oct/1999:23:33:54 +0100] "GET / HTTP/1.0" 304 0 "-" "Mozilla/4.7 [en] (X11; I; Linux 2.0.35
i686)"
callaghan.ecs.soton.co.uk - - [23/Oct/1999:23:33:55 +0100] "GET /uk.gif HTTP/1.0" 304 - "http://xxx.soton.ac.uk/" "Mozilla/4.7 [en]
(X11; I; Linux 2.0.35 i686)"
The general format is:
"Clients ID" "-" "-" "Date" "HTTP Request" "HTTP Response" "Amount of Data Transferred (in bytes)" "Referring
Page" "Clients Details"
The second and third fields are dashes throughout the entire log.
Removing Unwanted Data from the site logs
Unwanted accesses (e.g. spiders and robots) needed to be removed from the logs. This was done by removing any lines that contain the
words "cache", "Cache", "spider" or "Spider" or do not contain the word "Mozilla", this should catch the majority of
spider accesses and the uploads from the main site to the mirror. The files outputted by this script are in the same format as the
inputted files. Originally the logs had 2059223 entries, after running the script they had 1333395.
Sorting the Logs by User
After the unwanted data had been removed the logs could be sorted by user. This would more easily allow tracking of individual users,
instead of having many accesses by different users all mixed up within a short time.
Sorting Example
Original |
Sorted |
User A gets page 1 at time 1
User B gets page 1 at time 2
User B gets page 2 at time 3
User A gets page 2 at time 4
User B gets page 3 at time 5
User A gets page 3 at time 6
|
User A gets page 1 at time 1
User A gets page 2 at time 4
User A gets page 3 at time 6
User B gets page 1 at time 2
User B gets page 2 at time 3
User B gets page 3 at time 5
|
[ Previous ]
[ Top ]
[ Home ]
[ Next ]
|