CPSC 641: Performance Issues in High Speed Networks

Professor Carey Williamson

January 2009

Assignment 1: Web Site Data Analysis (15 marks)

Due Date: Tuesday, January 27, 2009 (3:00pm)

The purpose of this assignment is to gain experience with data analysis, statistical methods, graph plotting, and interpretation of results. You will explore an empirical data set gleaned from the WWW2007 conference Web site, and use your data analysis skills to visualize and understand several of the structural properties of this data set.

Web Site Data Set

The 16th International World Wide Web Conference (WWW2007) was a major international conference that brought about 1,000 Web researchers and developers to Banff for 5 glorious days in May 2007. The University of Calgary was the host institution for the conference, with Professor Williamson as one of the General Chairs for the event.

The Web site for the WWW2007 conference (http://www2007.org) is hosted in the CPSC department at the University of Calgary. The site remains on-line as a permanent record for the event in the IW3C2 conference series. Your task in this assignment is to analyze the content hosted on the Web site.

The file www2007data.txt contains the output of the Unix command "ls -lR" in the home directory of the WWW2007 Web site (/home/projects/www2007). The output shows information such as the name of each file and directory, the file permissions, the file size, the file modification date, and so on. Using data analysis tools of your own choosing (e.g., grep, awk, perl, gnuplot, Excel, MatLab, C, C++, Java, Python), process this empirical data set to answer as many of the following questions as you can.

Data Analysis Questions

  1. (1 mark) How many different regular files (not directories) are stored on the site? What is the aggregate size of these files (in bytes)?
  2. (2 marks) What is the largest file on the site? How big is it? How many empty files (0 bytes) are there? What is the smallest non-empty file on the site? How big is it?
  3. (2 marks) What is the mean file size on the site? What is the standard deviation of file size? What is the median file size (50-th percentile value)? What is the mode (most frequently occurring value) of the file size distribution?
  4. (2 marks) Plot a graph showing the file size distribution. Make one graph for the empirical probability density function (pdf), and a separate one for the cumulative distribution function (CDF). Use a graph style (e.g., lines, boxes, histogram, scatterplot) and axis scaling (e.g., linear, logarithmic, log-linear, log-log) of your own choosing to convey the distribution effectively. Comment on your observations.
  5. (2 marks) With a bit of effort, you should be able to analyze the file type distribution. On a Unix system, file types can be determined heuristically based on the (optional) suffix in the file name (e.g., foo.html, paper127.pdf, painful.doc). Produce a table showing the top 10 known file types on the site, in sorted order from most prevalent to least prevalent. Within this table, show the number of files of each type, the percentage of files of each type, the number of bytes for each file type, and the percentage of bytes for each file type. If necessary, use a catch-all category "Unknown" for any file types that are not easily discernible from the file name suffix. In the table, add a category "Other" for those files not accounted for among the top 10 file types, so that the percentages in the table sum properly to 100%. Comment on your observations.
  6. (2 marks) Plot a graph showing the file size distribution for the PDF versions of the papers and posters in the conference proceedings (i.e., from the subdirectories ./papers and ./posters). Plot a CDF graph with two lines (one for papers, one for posters). Use a graph style and axis scaling of your own choosing to convey the distributions effectively. Comment on your observations.
  7. (2 marks) With some clever programming effort, you should be able to calculate (or estimate) the age of each file on the Web site (i.e., the number of days since it was last modified). For example, most of the files are about 700 days old. What is the oldest file on the Web site? How old is it? What is the newest file on the Web site? How old is it? What are the mean, median, and mode for the file age distribution?
  8. (2 marks) Plot a CDF graph showing the file age distribution. Use a graph style and axis scaling of your own choosing to convey the distribution effectively. Comment on your observations.

Assignment Submission

When you are finished, please submit your assignment solution in hardcopy form to your instructor, on or before the stated deadline.