SENG 301 project

Groups

You are to work on this project in groups of 2 or 3 students. You are responsible to form your own groups.

Over the course of the term you will submit five deliverables. See the course schedule for which deliverable is due when.

Background

The application you create for this project will be based on a data set from the movielens project. This data set consists of:

The data is divided into 6 separate text files. A description and a download link for each of these files follows.

u.data

100000 ratings by 943 users on 1682 items (movies). Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of:

user id | item id | rating | timestamp.

The time stamps are unix seconds since 1/1/1970 UTC.

u.info

The number of users, items, and ratings in the data set.

u.item

Information about the items. This is a list of

movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |

The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data file.

u.genre

A list of the genres.

u.user

Demographic information about the users; this is a list of:

user id | age | gender | occupation | zip code

The user ids are the ones used in the u.data data set.

u.occupation

A list of the occupations.

Features

For this project you are to create a Java command line application called MovieExplore. The following features are described in terms of the command line (i.e., how the user calls the program) and a description of the expected output. (Note that of course the actual values in the output will depend on the input data -- so below we are just describing the format for the output.)

The first argument is always the name of a directory containing the data files. (To make testing easier you may want to create several data directories each containing small amounts of data.)

The MovieExplore application parses the supplied command line arguments and parses (as necessary) the data files to provide the information requested by the user (i.e., the information requested on the command line). This information is written to standard out (i.e., to the console).


Feature 1: Basic statistics

When supplied the stats command line argument, MovieExplore reports overall statistics from the data set in the given directory. The stats include the number of users, the number of female users, the number of male users, the number of movies, the number of movies in each genre, the total number of ratings and the total number of 1, 2, 3, 4 and 5 ratings.

> java MovieExplore data stats
users:       943
  female:    273
  male:      670
movies:     1682
  Action:      X
  Adventure:   Y
  (etc)
ratings:  100000
  1:           X
  2:           Y
  (etc)

Feature 2: Most popular movies

When supplied the top command line argument, MovieExplore lists the 5 most popular movies in the data set.

For example:

> java MovieExplore data top
Pop  Title
-------------------
95   Toy Story (1995)
94   Mr. Holland's Opus (1995)
90   Apollo 13 (1995)
80   Sound of Music, The (1965)
75   Clerks (1994)

To rank the movies you will need to develop a "popularity" algorithm that takes into account all of the ratings for a movie. Popularity should be represented as a number between 0 and 100. You will also need to decide how to break ties.

Feature 3: Most popular movies for genre

When supplied the top and --genre=... command line arguments, the 5 most popular movies in the given genre are listed. The Genre can be specified as a number or a name (see the u.genre, file).

> java MovieExplore data top --genre=Crime
> java MovieExplore data top --genre=6

The output format should be the same as for feature 2.

Feature 4: Most popular movies for release year

When supplied the top and --year=... command line arguments, the 5 most popular movies for the release year or release decade (by providing only three digits for the year) are listed.

> java MovieExplore data top --year=2003
> java MovieExplore data top --year=199

The output format should be the same as for feature 2.

Feature 5: Most popular movies by location

When supplied the top and --zip=... command line arguments, the 5 most popular movies based on ratings made by users in the given zip code are listed. The location considered can be made larger by providing only the leading digits of the zip code.

> java MovieExplore data top --zip=90703
> java MovieExplore data top --zip=90

The output format should be the same as for feature 2.

Feature 6: Most popular movies by any of above

The types of filters listed in features 3, 4 and 5 can be combined. In this case MovieExplore lists the 5 most popular movies that match all the supplied criteria. For example to list the most popular movies of a given genre for a give year:

> java MovieExplore data top --year=2003 --genre=Action

The output format should be the same as for feature 2.

Feature 7: Changing the number of movies listed

For features 2, 3, 4, 5 and 6 the number of movies to return is 5 by default (or fewer if fewer movies fit the criteria), however if the --limit=N argument is added, N movies are returned instead of five. For example the following would return the most top 10 most popular movies in the Crime genre:

> java MovieExplore data top --genre=Crime --limit=10

Feature 8: Histogram

When the commands described in features 2, 3, 4, 5 and 6 are supplied --histogram as an additional command line argument a histogram accompanies the list of movies. The * character is used for the histogram and the number of *'s for a given file (or author) is based on the following formula:

log(total)/log(max_total) * 30

Where total is the total popularity measure for the given movie (based on the algorithm used in the above features) and max_total is the largest popularity over all movies being listed. This is referred to as log normalizing. For example:

> java MovieExplore data top --limit=4 --histogram
Pop  Title
-------------------
95   Toy Story (1995)            ******************************
94   Mr. Holland's Opus (1995)   ******************************
90   Apollo 13 (1995)            ******************************
80   Sound of Music, The (1965)  *****************************

Feature 9: User details

When supplied the user ID command line option, MovieExplore lists information about the user with the given id. For example:

> java MovieExplore data user 335
User id:    335
Age:        45
Gender:     M
Occupation: executive
Zip:        33775
Ratings:   
  Total:    22
  Average:  3.54
Top:        5 - Boogie Nights (1997)
Bottom:     1 - Contact (1997)

The rating data is the total number of movies rated by the user and the average rating (out of 5). The users top rated movie and lowest rated movie are listed (you can break ties arbitrarily). In the example above, user 335 gave a 5 to "Boogie Nights (1997)" and a 1 to "Contact (1997)".

Feature 10: Finding similar users

When supplied the similar-users ID command line argument, MovieExplore computes and lists the 5 users with the most similar taste to the user with the given id. The users should be listed staring with the most similar user.

> java MovieExplore data similar-users 335
ID  Age Gen Occupation Zip
----------------------------
192 42  M   educator   90840
24  21  F   artist     94533
(etc)

You need to decide how you will compute similarity, but you should consider "Euclidean distance" as a starting point.

Note: If, according to the algorithm you use there are fewer than 5 similar users, the list you display can contain fewer than 5 users.

Feature 11: Movie recommendations

When supplied the movie-recommendations command line argument, MovieExplore computes and lists the 5 most highly recommended movies for the user with the given id. So the following will recommend movies for user 335. None of the movies rated by the given user should be included in the list of recommendations.

> java MovieExplore data movie-recommendations 335
Richard III (1995)
Lion King, The (1994)
Nightmare Before Christmas, The (1993)
(etc)

The recommendations should be based on the similarity scores you compute for feature 10. You will need to research how best to do this.

Feature 12: Movie recommendations by year and/or genre

When supplied the movie-recommendations, --year=... and/or --genre=... command line arguments, MovieExplore works the same as for feature 11, except that recommended movies are limited to the specified year and/or genre. So the following returns recommended Crime movies for user 335:

> java MovieExplore data movie-recommendations --genre=Crime 335

While the following recommends movies for user 335 from the eighties:

> java MovieExplore data movie-recommendations --year=198 335

Error handling

For all of the features described above you will need to handle errors as follows:

Feature E1: Usage error

If the user supplies incorrect arguments, MovieExplore outputs a usage message describing all of the valid command line arguments and exits with a return code of 1.

Feature E2: Errors in input file

If the specified data directory/files can not be found, or contain errors, MovieExplore outputs an appropriate error message and exits with a return code of 2.