General Info

CPSC 599.44: Machine Learning - Assignments

The assignments for this course will center around learning from a publically available real data collection, more precisely the National Collision Database of Canada. The page contains data from 1999 to 2014 and also just the incidents of 2014. For the assignment, the 1999 to 2014 data is to be used (which is a file of 369.2 MB). The data is not really bilingual (as claimed) since it uses only numbers and made-up identifiers and they are explained in the data dictionary. Reading the comments (which include answers by Transport Canada) is also useful (and reveals some of the challenges we have when dealing with data that we did not collect ourselves). We have the following three assignments:

  • Each team has to write a midterm report in which they sketch what basic learning method they want to use on the data and what additional knowledge they want to build into this basic method.
  • Each team then naturally also has to implement the system described above.
  • Each team member has to write an individual report on the system and the experiments the team and the student performed with it.

How the three assignments are weighted within the course grade is explained on the assessment page.

Midterm report on intended learning system

The goal of the learning system to be developed is to learn rules that describe (some) of the data. These rules do not need to be 100% accurate and part of the experiments a team will perform is to look at different levels of accuracy.

As stated in class there are not only many methods to learn rules directly from data, most methods for learning other data structures can also be used to create rules, since these data structures can be transformed into rules. The first goal of the midterm report is to provide the choice the team has made with regard to what learning method to use and the reasoning behind this particular choice.

While many machine learning techniques are general enough to be applied to nearly all data, the learned knowledge is often not very interesting, since people usually have additional knowledge about the main domain of the data (and other domains touched by the data, like knowledge about work days and weekends) that should be used when trying to perform any learning. The second goal of the midterm report is to suggest ways how to include knowledge of relevance to the data base (that is not directly included into it) into the learning system (an obvious example for this data base is visibility due to it being day or night, which is not provided, but can be deduced out of the time of day, day of year and the weather conditions that are reported).

Such relevant knowledge can be part of pre-processing (even used to split the data into different parts that the learning will treat differently), it can be build into the base learning method, and it can be applied in post-processing steps. Each team should at least discuss three different kinds/ways of including relevant data into their mining system. When the system is implemented, I expect experiments that show what results including this additional knowledge had compared to the base learning method.

In addition to these three different team-selected ways, the system to implement should also be able to read a list of already known rules about the data set and to avoid producing those rules again. How this can be done differs from learning method to learning method and even for a particular learning method there are usually different ways how to do this (ranging from a dumb filter after learning is finished to really influencing the learning method to concentrate on other rules).The report also has to explain how the team wants to do this and as in the case of the above, I am expecting experiments evaluating this feature.

Deadline: March 2, 2018, noon

Submission procedure
Each team has to send me the report as pdf-file in an email.

Implemented system

After having received my comments on the midterm report, each team has to implement the system they suggested in that report (taking my comments into account). For the base method, a team is allowed to make use of code available from the Internet (for example, Weka), but that might make it more difficult to include the additional knowledge and avoiding given rules, so that careful thinking about this is required. Naturally, if such foreign code is used this will have to be indicated in the system code and the final reports by the team members.

As mentioned above, it has to be possible to compare learning results of the base system with results from the system when using the additional knowledge.

Deadline: April 13, 2018, noon

Submission procedure
I want the system code emailed to me in a zip or tar file. The directory has to include a readme file telling me exactly how to compile the system and, naturally, every code necessary to run the system has to be included. The system has to run on the Linux compute servers of the Computer Science Department.

Individual report

Each team member has to write an individual report about the system and experiments that the whole team and the individual member have performed with it. This report has to cover the following topics:

  • Any changes to the functionality of the system compared to what was covered in the team midterm report.
  • The role(s) the team member played in creating the system.
  • Experiments performed with the system. In addition to the experiments mentioned before on this page, any evaluations of parameters of the system that the team member considers of interest.
  • The highlights of what the system found in the data.

Deadline: April 13, 2018, noon

Submission procedure
Each team member has to send me the report as pdf-file in an email.

to the timetable for the course.

Last Change: 3/11/2017