Clark was developed in conjunction with the University of Michigan Engineering Library, the University of Michigan School of Information and Library Studies, and the Internet Public Library. It may be used free of charge, and you may feel free to modify it for your personal use. However, you may not redistribute it. Please see the licensing agreement for detailed information.
For Unix users: place the file in the same directory as your access log files. Set the file to executable (chmod u+x clark.pl). Next, you need to find where your Perl compiler/interpreter is. Type which perl at the command prompt; Unix will spit back a pathname at you (something along the line of /user/bin/perl). Use your favorite text editor to go into the clark.pl file and change the first line to #!pathname
where pathname
is what you determined above.
For non-Unix users: check with your Perl documentation for how to set up a script to run.
Finally, make sure that you have plenty of space available for the output file; the output file will take up approximately half the space of your input file.
compsun4% clark.pl Input Filename: july01.1995 Output Filename: july01.out Reading july01.1995 Read 9126 lines of data Gap size, in minutes [30]: 30 Would you like to Exclude hosts, Only Include certain hosts, or neither (e/i/n)[ n]? n Processing data -- please be patient Writing transaction log to july01.out 9126 requests were processed 9126 met the given criteria 874 transactions were logged, using a gap size of 30 minutes compsun4%Explanations:
The first two lines ask you for file names. The Input Filename is the access log file generated by your http server; the Output Filename is where you want the transaction log generated by Clark to be put. Warning: The current version of Clark does not check to see if your input file is valid; it will go along merrily parsing the heck out of garbage if you let it. However, Clark will ask if it is okay to overwrite your output file if the output filename already exists.
Clark will then take a few seconds to read in your access log file.
Next, Clark asks you for the Gap Size. Clark uses the Gap Size to determine when to start a new transaction. For example, if the gap size is set for 30 minutes, after 30 minutes elapses after a request by a particular host, Clark assumes that any new request from that same host is the beginning of a new transaction. Preliminary tests have shown that changing the Gap Size from 30 to 15 minutes can change the number of transactions registered by as much as 10%. The default Gap Size is 30 minutes.
Clark then asks if you would like to exclude hosts or only include certain hosts. To exclude certain hosts from analysis, enter e then enter the hostname and/or IP address of the machine you wish to exclude from analysis; enter a null line to terminate entry. Clark will thereafter ignore any events from the host(s) you specified. (You would want to exclude hosts if, for example, you have certain machines that are for staff or development use that you do not want to be used in your TLA.) To include only certain hosts in the transaction log, enter i then enter the hostname(s) and/or IP address(es) similarly. Clark will thereafter ignore any events from any hosts that you did not specify. To include all events in the transaction log, enter n or just enter a null line.
Now Clark will set about processing the transaction log. This can take a long time, and time increases geometrically with size of your access log. For example, on a Sun Sparc20, an access log of about 10,000 events takes about 40 min. of processor time, while an access log of about 1,000 events takes only about 2 min. of processor time; of course, actual running time will depend upon the load of your processor. Times on other computers will of course vary; it is recommended that you do not use a personal computer to process access logs of more than 1,000 entries.
When Clark finishes, it writes the transaction log to the specified file, then gives some brief summary statistics: how many events were registered, how many events met your criteria, and how many transactions were logged.
*303 gk-east.usps.gov - - 12 01/Jul/1995 01/Jul/1995 11:21:02 11:28:06 424 1 "GET / HTTP/1.0" 200 655 "GET /images/rad.logo.gif HTTP/1.0" 200 17592 "GET /images/newmarbledirectory.gif HTTP/1.0" 200 39245 "GET /cgi-bin/sils_imagemap/images/newmarbledirectory.map?221,89 HTTP/1.0" 302 0 "GET /ref/ HTTP/1.0" 200 1304 "GET /images/ipl.logo.small.gif HTTP/1.0" 200 963 "GET /images/refpict.gif HTTP/1.0" 200 53248 "GET /cgi-bin/sils_imagemap/images/refpict1.map?193,22 HTTP/1.0" 302 0 "GET /ref/RR/GEN/ HTTP/1.0" 200 1503 "GET /ref/RR/GEN/Dict-rr.html HTTP/1.0" 200 9667 "GET /ref/RR/GEN/Enc-rr.html HTTP/1.0" 200 3793 "GET /ref/RR/GEN/Atlas-rr.html HTTP/1.0" 200 4070Explanations:
Line 1: Sequential transaction number, preceeded by an asterisk (*)
Line 2: Host name or IP address
Line 3: identid information (- if none)
Line 4: username, if authenticated (- if not)
Line 5: Number of events in transaction
Line 6: Start Date
Line 7: End Date
Line 8: Start Time
Line 9: End Time
Line 10: Transaction length, in seconds
Line 11: Check digit for image loading: 1 if .gif files were requested, 0 if not
Lines 12 - end: Events from that host in the transaction, in chronological order.
A blank line separates transactions.
So go and do good research!
Copyright 1995 David S. Carter, All rights reserved