CPSC 641: Performance Issues in High Speed Networks

Professor Carey Williamson

Winter 2020

Assignment 2: Network Traffic Analysis (30 marks)

Due Date: Thursday, February 13, 2019 (11:59pm)

The purpose of this assignment is to gain experience with data analysis, statistical methods, graph plotting, and interpretation of results. You will analyze several empirical datasets, applying your data analysis skills to explore and understand some of the structural properties of the data.

Background

One assumption that is often made in analytical modeling work is that network-related events (e.g., user sessions, TCP connections, HTTP requests, packet arrivals) occur according to a so-called "Poisson arrival process". Informally, this means that events occur "randomly" (i.e., at random time instants that are impossible to predict, even if the average arrival rate is known). More formally, it means that the inter-arrival times between events are exponentially distributed and independent. If this is the case, then the counts of the number of events that occur within any chosen fixed-size time interval (e.g., 1 minute, 1 hour, 1 day) should follow the Poisson distribution (i.e., a discrete distribution for which the mean and variance are equal; the histogram for such a distribution usually has a nice humpy shape with a pronounced tail on the right).

In a Poisson arrival process, there is a well-defined average rate for the arrivals, such as K events per time unit. As a result, the mean inter-arrival time is 1/K time units between events. Furthermore, the distribution of the inter-arrival times is exponential. The exponential distribution is the only continuous distribution with the "memoryless" property, which makes it easier to analyze mathematically. Recall that an exponential distribution has only positive values (i.e., strictly greater than zero), with no upper limit (i.e., potentially infinite). Nonetheless, the histogram of such a distribution typically shows a lot of small values at or below the mean, and a gracefully declining probability (i.e., exponential decay) of observing values much larger than the mean. In particular, the Coefficient of Variation (CoV, which is the ratio of the standard deviation to the mean) for the exponential distribution is exactly 1.

Your Task

Your task in this assignment is to analyze some empirical network traffic datasets and determine which (if any) are consistent with a Poisson arrival process. Note that checking for this property involves two separate tasks: (1) checking if the inter-arrival times are exponential; and (2) checking if the inter-arrival times are independent. The first task (exponentiality) can be done in a variety of ways (e.g., statistical methods, graphical methods, goodness-of-fit tests, QQ plots, Anderson-Darling, KS test, Chi-Square test, etc.), as you deem appropriate. The second task (independence) can also be done in several ways (e.g., statistical, graphical, Pearson correlation coefficient, autocorrelation, spectral analysis, etc.), but technically isn't required if the exponentiality test has already failed. (See Appendix A of the 1994 ACM SIGCOMM paper by Paxson and Floyd on "The Failure of Poisson Modeling" for a detailed discussion of these tests)

Please process the following six empirical datasets to determine which (if any) exhibit Poisson-like structure in their network arrival process:

  1. papers: paper submissions to an ACM journal (N=484)
  2. logouts: students logging out of D2L on the evening of March 1, 2017 (N=1,192)
  3. emails: emails received regarding the ACM SOSP 2019 conference (N=1,781)
  4. logins: user logins to csx.cpsc.ucalgary.ca (so far) in January 2020 (N=3,696)
  5. ones: HTTP requests to IP address 111.111.111.111 on April 1, 2019 (N=4,673)
  6. packets: the timestamps of Ethernet frames on a 10 Mbps LAN in October 1989 (N=1,000,000)

Note that these datasets are all different sizes, and in different formats, just to give you some extra practice in your data analysis skills. Also note that some datasets may have a few imperfections or anomalies within them, so please watch out for these, and find a reasonable way to handle them. If you have any questions about the data formats, or need tips on how to process them, just let me know. Have fun!

Data Analysis Tasks

For each of the empirical datasets, conduct the following analysis steps to help answer the questions indicated:

Produce a table to summarize your results. Use one row for each dataset, and use the columns to summarize the main features of each dataset (e.g., number of data points, duration, min/median/max iat, mean and standard deviation of iat, CoV, exponentiality, independence, and whether it is a Poisson arrival process or not). See below for a crude example of a suggested table format.

File NumObs Duration Min Median Max Mean StdDev CoV Exponential? Independent? Poisson?
foo1 120 3.2 hrs 2.6 8.0 75.4 20.4 12.5 0.6 No N/A No
foo2 500 7.1 yrs 106 819 6475 436 450 1.03 Yes Yes Yes
foo3 1,200 60 min 0.002 0.032 0.124 0.05 0.05 1.0 Yes No No

Optional Bonus (3 marks)

Augment your results table with one additional empirical dataset of your own personal choice. Make sure that it has at least 1,000 data points, but not more than 10,000. Say what the dataset is, and how it was collected. Then complete your results table with your observations about this empirical dataset. State whether it follows a Poisson arrival process or not, and give some (brief) logical explanation as to why or why not.

Assignment Submission

When you are finished, please submit your assignment solution in hardcopy form to your instructor, on or before the stated deadline. Please include your summary table showing results for all six datasets, and any relevant parts of your writeup. However, to save paper, you only need to include the pdf/CDF graphs for two of your six datasets, with one of them being a good example of a Poisson arrival process, and one not. Thus you should make sure that you find at least one example of each type among the six datasets above. If you do the bonus, please include those pdf/CDF graphs as well. Thanks!