Assignment 2: Network Traffic Analysis (30 marks)
Due Date: Thursday, February 13, 2019 (11:59pm)
The purpose of this assignment is to gain experience with data analysis, statistical methods, graph plotting, and interpretation of results. You will analyze several empirical datasets, applying your data analysis skills to explore and understand some of the structural properties of the data.
Background
One assumption that is often made in analytical modeling work is that network-related events (e.g., user sessions, TCP connections, HTTP requests, packet arrivals) occur according to a so-called "Poisson arrival process". Informally, this means that events occur "randomly" (i.e., at random time instants that are impossible to predict, even if the average arrival rate is known). More formally, it means that the inter-arrival times between events are exponentially distributed and independent. If this is the case, then the counts of the number of events that occur within any chosen fixed-size time interval (e.g., 1 minute, 1 hour, 1 day) should follow the Poisson distribution (i.e., a discrete distribution for which the mean and variance are equal; the histogram for such a distribution usually has a nice humpy shape with a pronounced tail on the right).
In a Poisson arrival process, there is a well-defined average rate for the arrivals, such as K events per time unit. As a result, the mean inter-arrival time is 1/K time units between events. Furthermore, the distribution of the inter-arrival times is exponential. The exponential distribution is the only continuous distribution with the "memoryless" property, which makes it easier to analyze mathematically. Recall that an exponential distribution has only positive values (i.e., strictly greater than zero), with no upper limit (i.e., potentially infinite). Nonetheless, the histogram of such a distribution typically shows a lot of small values at or below the mean, and a gracefully declining probability (i.e., exponential decay) of observing values much larger than the mean. In particular, the Coefficient of Variation (CoV, which is the ratio of the standard deviation to the mean) for the exponential distribution is exactly 1.
Your Task
Your task in this assignment is to analyze some empirical network traffic datasets and determine which (if any) are consistent with a Poisson arrival process. Note that checking for this property involves two separate tasks: (1) checking if the inter-arrival times are exponential; and (2) checking if the inter-arrival times are independent. The first task (exponentiality) can be done in a variety of ways (e.g., statistical methods, graphical methods, goodness-of-fit tests, QQ plots, Anderson-Darling, KS test, Chi-Square test, etc.), as you deem appropriate. The second task (independence) can also be done in several ways (e.g., statistical, graphical, Pearson correlation coefficient, autocorrelation, spectral analysis, etc.), but technically isn't required if the exponentiality test has already failed. (See Appendix A of the 1994 ACM SIGCOMM paper by Paxson and Floyd on "The Failure of Poisson Modeling" for a detailed discussion of these tests)
Please process the following six empirical datasets to determine which (if any) exhibit Poisson-like structure in their network arrival process:
- papers: paper submissions to an ACM journal (N=484)
- logouts: students logging out of D2L on the evening of March 1, 2017 (N=1,192)
- emails: emails received regarding the ACM SOSP 2019 conference (N=1,781)
- logins: user logins to csx.cpsc.ucalgary.ca (so far) in January 2020 (N=3,696)
- ones: HTTP requests to IP address 111.111.111.111 on April 1, 2019 (N=4,673)
- packets: the timestamps of Ethernet frames on a 10 Mbps LAN in October 1989 (N=1,000,000)
Note that these datasets are all different sizes, and in different formats, just to give you some extra practice in your data analysis skills. Also note that some datasets may have a few imperfections or anomalies within them, so please watch out for these, and find a reasonable way to handle them. If you have any questions about the data formats, or need tips on how to process them, just let me know. Have fun!
Data Analysis Tasks
For each of the empirical datasets, conduct the following analysis steps to help answer the questions indicated:
- Meta-Data Processing: (1 mark) Familiarize yourself with the dataset, and record basic information. What is the name of the file? What is the format (e.g., number of columns, what they mean, time unit, continuous or discrete, sorted or not) of the data? How many data points are there? What is the total time duration represented in the trace? What is the average event arrival rate? Determine what tools/techniques you are going to need to process this data, and obtain/build them as needed. Then convert the data into inter-arrival time (iat) form for analysis.
- Statistical Analysis: (1 mark) Compute and record some basic summary statistics about the iat data. What is the numerical range (e.g., min and max) of the data? What is a typical observed value (e.g., median or mode) for the data? What is the mean iat? What is the standard deviation? What is the CoV? Is it close to 1?
- Graphical Analysis: (1 mark) Calculate the empirical probability mass function (pmf) or probability density function (pdf) for the iat data, and draw a graph (i.e., histogram) to illustrate this distribution. Also generate and plot a Cumulative Distribution Function (CDF) of the inter-arrival time distribution. Comment on your observations.
- Testing: (2 marks) Test the inter-arrival time data in some way to see if it is consistent with that of an Exponential distribution with the same mean event arrival rate. If necessary, test the independence of the empirical inter-arrival times as well. Finally, state whether the dataset is consistent with a Poisson arrival process or not. Comment on your observations.
Produce a table to summarize your results. Use one row for each dataset, and use the columns to summarize the main features of each dataset (e.g., number of data points, duration, min/median/max iat, mean and standard deviation of iat, CoV, exponentiality, independence, and whether it is a Poisson arrival process or not). See below for a crude example of a suggested table format.
File | NumObs | Duration | Min | Median | Max | Mean | StdDev | CoV | Exponential? | Independent? | Poisson? |
---|---|---|---|---|---|---|---|---|---|---|---|
foo1 | 120 | 3.2 hrs | 2.6 | 8.0 | 75.4 | 20.4 | 12.5 | 0.6 | No | N/A | No |
foo2 | 500 | 7.1 yrs | 106 | 819 | 6475 | 436 | 450 | 1.03 | Yes | Yes | Yes |
foo3 | 1,200 | 60 min | 0.002 | 0.032 | 0.124 | 0.05 | 0.05 | 1.0 | Yes | No | No |
Optional Bonus (3 marks)
Augment your results table with one additional empirical dataset of your own personal choice. Make sure that it has at least 1,000 data points, but not more than 10,000. Say what the dataset is, and how it was collected. Then complete your results table with your observations about this empirical dataset. State whether it follows a Poisson arrival process or not, and give some (brief) logical explanation as to why or why not.
Assignment Submission
When you are finished, please submit your assignment solution in hardcopy form to your instructor, on or before the stated deadline. Please include your summary table showing results for all six datasets, and any relevant parts of your writeup. However, to save paper, you only need to include the pdf/CDF graphs for two of your six datasets, with one of them being a good example of a Poisson arrival process, and one not. Thus you should make sure that you find at least one example of each type among the six datasets above. If you do the bonus, please include those pdf/CDF graphs as well. Thanks!