CPSC 641: Performance Issues in High Speed Networks

Professor Carey Williamson

Winter 2022

Assignment 1: Spotify Traffic Analysis (30 marks)

Due Date: Thursday, January 27, 2022 (4:00pm)

The purpose of this assignment is to gain experience with data analysis, statistical methods, graph plotting, and interpretation of results. You will analyze some empirical network traffic datasets, applying your data analysis skills to explore and understand some of the structural properties of the data.

Background

Spotify is a popular music streaming service that is heavily used by students, faculty, and staff at the University of Calgary. Users with accounts on Spotify can browse a catalog of over 50 million tracks, including songs, audio books, and podcasts. Media content is transferred using well-known Internet protocols, such as secure HTTP (HTTPS) and TCP/IP. Some of this content comes from Spotify's own servers, while the rest comes from Content Delivery Network (CDN) nodes, such as Akamai and Fastly.

Your Task

Your task in this assignment is to analyze empirical network traffic data to identify structural characteristics in the data. The data provided to you is connection-level data, with each row of the log summarizing the network activity from one TCP connection. The data format includes timestamp, client IP and port, server IP and port, protocol, duration, connection state, packets sent, bytes sent, packets received, and bytes received.

The data provided to you represents all known Spotify traffic on the campus network on a single day, with about one million TCP connections. There are 24 one-hour gzipped data files available for you in the data directory, with one for each hour of the day. Note that some datasets may have some imperfections, so please watch out for these, and find a reasonable way to handle them. If you have any questions about the data, just let me know.

Please start small, by downloading one of the files, extracting it, studying its format, and then developing and testing your data analysis tools. Once you think your tools are working correctly, you can grab the rest of the data and start doing the full analysis. However, please watch out for any new problems that emerge at this scale. In the worst case, you might have to analyze one data file at a time, and then stitch your intermediate results together for the full empirical dataset.

Data Analysis Tasks

Using the (full) empirical dataset, do the following data analysis steps:

  1. Arrival Time Analysis (5 marks)
    Focus on the leftmost column of the data, which represents the time (in seconds, using Linux epoch timestamp format) at which each TCP connection started. After sorting this data into monotonically increasing order, run an analysis to count the number of TCP connections that arrive in each one hour interval of the day. Draw a graph of this arrival pattern. Repeat your analysis at the one minute granularity, and draw the corresponding graph. Comment on your results and observations.
  2. Connection Duration Analysis (5 marks)
    Focus on the column of the data that represents the duration (in seconds) for each TCP connection recorded in the logs. Calculate some basic statistics about the durations, such as minimum, median, mean, and maximum values, and perhaps the standard deviation as well. Do your best to draw a reasonable graph of the empirical pdf (probability density function) and CDF (Cumulative Distribution Function) of this data. Comment on your results and observations.
  3. Transfer Size Analysis (5 marks)
    Focus on the column of the data that represents the number of bytes sent on each TCP connection. Calculate some basic statistics about the bytes sent, such as minimum, median, mean, and maximum values, and perhaps the standard deviation as well. Do your best to draw a reasonable graph of the empirical pdf and CDF of this data. Repeat the analysis for the number of bytes received on each TCP connection, adding these lines onto the foregoing graphs if you can. Comment on your results and observations.
  4. Client Analysis (5 marks)
    Focus on the column of the data that shows the client IP address for each TCP connection. Calculate the number of connections initiated by each client IP address. Draw a frequency-rank profile graph that shows the number of connections on the vertical axis, and the relative rank of clients (from most connections to fewest) on the horizontal axis. Repeat this type of analysis for the total number of bytes exchanged by each client (i.e., sent and received). Comment on your results and observations.
  5. Server Analysis (5 marks)
    Focus on the column of the data that shows the server IP address for each TCP connection. Calculate the number of connections involving each server IP address. Draw a frequency-rank profile graph that shows the number of connections on the vertical axis, and the relative rank (from most connections to fewest) of servers on the horizontal axis. Repeat this type of analysis for the total number of bytes (sent and received) by each server. Comment on your results and observations.
  6. Your Own Analysis (5 marks)
    Choose any other interesting aspect of this empirical dataset (i.e., involving one or more columns, and one or more rows of data) not mentioned above, and analyze it to explore its structural patterns or trends. Produce a graph or table highlighting your data analysis results. Comment on your results and observations.

Optional Bonus (2 marks)

Use a packet sniffer, such as Wireshark, to collect a short packet trace (say, at most 5 minutes) of your own usage of Spotify from a laptop or mobile device. Calculate statistics or draw a graph to show one or two things that you can learn at the packet level that are difficult or impossible to discern using connection-level data like that given for this assignment. (You do not need to hand in your packet trace, but please keep it handy in case I ask to see it later).

Assignment Submission

When you are finished, please submit your assignment solution in hardcopy form to your instructor, on or before the stated deadline. If you do the bonus, please include those results as well (but not the packet trace). Thanks!