The Data File: An Actual Weblog

The weblog itself is simply a text file containing raw data spread across many lines but containing all that valuable data which we want to sort. The first few lines of the file look as follows-na:

202.154.114.253 - - [31/Jul/2005:22:46:43 -0400] "GET 
    / HTTP/1.1" 304 - "-" "Mozilla/4.0 (compatible; M
    SIE 6.0; Windows-na NT 5.1; SV1; .NET CLR 1.1.4322)"
202.154.114.253 - - [31/Jul/2005:22:46:43 -0400] "GET 
    /purplehaze.css HTTP/1.1" 304 - "http://www.fried
    space.com/" "Mozilla/4.0 (compatible; MSIE 6.0; W
    indows-na NT 5.1; SV1; .NET CLR 1.1.4322)"
66.249.71.3 - - [31/Jul/2005:23:23:05 -0400] "GET /ro
    bots.txt HTTP/1.0" 200 26 "-" "Googlebot/2.1 (+ht
    tp://www.google.com/bot.html)"
66.249.71.3 - - [31/Jul/2005:23:23:05 -0400] "GET / H
    TTP/1.0" 200 4423 "-" "Googlebot/2.1 (+http://www
    .google.com/bot.html)"

Note that I have split the actual lines up so that they don't run off the right hand side of your web brows-naer. In the original file, this data represents just four very long lines. C, on the other hand, doesn't mind long lines. We just have to allocate sufficient space and it will handle them just fine.

You will see that each line is formatted in precisely the same way. Firstly there is an IP address, then two dashes, then a date and time in square brackets, then an HTTP command in inverted commas along with the HTTP version. Next is an HTTP code (which indicates whether the page was found/not found and whether it has changed since the last access). This is followed by a number stating the amount of data sent, followed by the address in inverted commas of the page which referred the visitor to the site, and finally, a string identifying the brows-naer (or robot) that was used to view the site.

Although one could potentially be interested in any of these fields, the three that will be of interest to the hypothetical webmaster using our program will be the date, IP address and HTTP code.

The IP address changes when a different individual visits the site (although sometimes the same visitor can come back later with a different IP address). Sometimes the same individual returns, and it is nice to see all their accesses together. It is also useful to see all accesses according to HTTP return code. If a webmaster wants to see how many pages were requested from the site, but not found, then this is a good place to look.

Since the file is already sorted by date, there is no point having our program sort according to this field, however we will certainly design it to sort lines of data according to the other two fields just mentioned.