CPSC 599.44: Machine Learning - Assignments
The assignments for this course will center around learning from a publically available real information collection, more precisely the National Collision Database of Canada. The page contains information from 1999 to 2017 and also just the incidents of several individual years. For the assignment, the 1999 to 2017 information is to be used. The information is not really bilingual (as claimed) since it uses only numbers and made-up identifiers and they are explained in the data dictionary (making them information). Reading the comments (which include answers by Transport Canada) is also useful (and reveals some of the challenges we have when dealing with information that we did not collect ourselves). We have the following three assignments:
How the three assignments are weighted within the course grade is explained on the assessment page.
Midterm report on intended learning system
The goal of the learning system to be developed is to learn rules that describe (some) of the information. These rules do not need to be 100% accurate and part of the experiments a team will perform is to look at different levels of accuracy.
As stated in class there are not only many methods to learn rules directly from examples, most methods for learning other data (knowledge) structures can also be used to create rules, since these data structures can be transformed into rules. The first goal of the midterm report is to provide the choice the team has made with regard to what learning method to use and the reasoning behind this particular choice.
While many machine learning techniques are general enough to be applied to nearly all examples, the learned knowledge is often not very interesting, since people usually have additional knowledge about the main domain of the information (and other domains touched by the information, like knowledge about work days and weekends) that should be used when trying to perform any learning. The second goal of the midterm report is to suggest ways how to include knowledge of relevance to the data base (that is not directly included into it) into the learning system (an obvious example for this data base is visibility due to it being day or night, which is not provided, but can be deduced out of the time of day, day of year and the weather conditions that are reported).
Such relevant knowledge can be part of pre-processing (even used to split the data base into different parts that the learning will treat differently), it can be build into the base learning method, and it can be applied in post-processing steps. Each team should at least discuss three different kinds/ways of including relevant knowledge into their mining system. When the system is implemented, I expect experiments that show what results including this additional knowledge had compared to the base learning method.
In addition to these three different team-selected ways, the system to implement should also be able to read a list of already known rules about the data set and to avoid producing those rules again. How this can be done differs from learning method to learning method and even for a particular learning method there are usually different ways how to do this (ranging from a dumb filter after learning is finished to really influencing the learning method to concentrate on other rules).The report also has to explain how the team wants to do this and as in the case of the above, I am expecting experiments evaluating this feature.
Deadline: March 6, 2020, noon
After having received my comments on the midterm report, each team has to implement the system they suggested in that report (taking my comments into account). For the base method, a team is allowed to make use of code available from the Internet (for example, Weka), but that might make it more difficult to include the additional knowledge and avoiding given rules, so that careful thinking about this is required. Naturally, if such foreign code is used this will have to be indicated in the system code and the final reports by the team members.
As mentioned above, it has to be possible to compare learning results of the base system with results from the system when using the additional knowledge.
Deadline: April 15, 2020, noon
Each team member has to write an individual report about the system and experiments that the whole team and the individual member have performed with it. This report has to cover the following topics:
Deadline: April 14, 2020, noon
Last Change: 29/10/2019