Computer Science 217
Assignment #2

Due: see Evaluation below

Purpose

You'll be applying what you've learned about Python to perform some data processing on a text file. In particular, you'll be writing Python programs using if-statements, loops and different Python data types.

Your program will try to determine if its input is English or random, by computing a value called the phi-statistic.

Important Note

This is an individual assignment. What you submit must be your own work, although you may discuss the problem in general terms with other people. You should definitely not be showing other people your code, and generally speaking, it is not a good idea to talk about the assignment when you're sitting in front of the computer.
Sources of algorithms and code, if any, must be properly cited. Remember that plagiarism regulations apply to code too. You can put citations into comments in your Python code.
If you have any questions about what you can and can't safely do, feel free to email me.

Background

The phi-statistic is a measure of how closely the frequency distribution of the individual characters in text match that of English. In English text, the individual characters tend to occur with fairly consistent frequencies. For example, the letter ``E'' is most common, followed by ``T'', ``A'', etc. A collection of random characters will have frequencies that are all roughly the same.

To compute the phi-statistic of an English text requires the following steps:

Compute the number of occurrences of each character in the text.
Suppose that character i occurs fi times and that there are n distinct characters. Then
phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)

For example, the text

HKWZA RRPVQ BIVYS MPDMQ MBUDC

has phi-statistic equal to 18 (there are 6 characters that occur twice, one that occurs 3 times, and the rest occur once).

One application of the phi-statistic is to automatically recognize English text. This is done by comparing the expected or average value of phi for both English text and random text with the computed value. The expected value of phi for English text is

expectedEnglish = 0.0661 N (N-1)

where N is the number of characters in the text. For random text, it is

expectedRandom = 0.0385 N (N-1) .

The previous example has N=25, so we compute

expectedEnglish = 0.0661 * 25 * (24) = 39.66

and

expectedRandom = 0.0385 * 25 * (24) = 23.1 .

Thus, since the computed value of 18 is closer to 23.1 than to 39.66, we conclude that the text is most likely a random collection of characters (which, in fact, it is!).

Assignment

Compute the phi-statistic for a piece of text and decide whether it is English or random.

Input.

A series of lines of text. The last line will contain only the word END as a sentinel (to indicate the end of the data).

Processing.

Compute the phi-statistic corresponding to all the text contained in the lines of input. For this computation, you should ignore all non-alphabetic characters, and process each letter as its upper-case equivalent, i.e., treat ``A'' and ``a'' as the same character. Also, compute the expected values of phi for English and random text of the same length as the input, counting only the alphabetic characters.

Output.

Print out the character frequency list, the total number of characters, the value of the phi-statistic, and the expected values of English and random text with the same number of characters. Finally, output whether or not the text is more likely English or random characters.

Example.

The sample input file poem.txt contains a poem written in English. You can either enter each line of text manually into your program, or pipe the contents of the file directly into Python

	cat poem.txt | python as2.py

This causes the raw_input() commands in your program to read the lines of the file, instead of waiting for user input.

The phi-statistic for this text is 11992, and there are 444 alphabetic characters. The expected values of phi for English and random texts are 13001.3412. and 7572.6420, respectively. As the computed value is closer to the expected value for English, the conclusion is that it is English.

Your program's output might look like this:

Frequency list:
        A         35
        C         18
        B         9
        E         50
        D         18
        G         6
        F         15
        I         25
        H         31
        K         2
        M         17
        L         21
        O         35
        N         29
        Q         1
        P         6
        S         23
        R         22
        U         16
        T         43
        W         8
        V         4
        Y         10
444 characters total
Phi statistic is 11992.0
Expected English value is 13001.3412
Expected random value is 7572.642
The input is probably English

The file poem.vigenere contains an encrypted version of the same poem in which the frequency counts should be more evenly distributed (making the code harder to break). The phi-statistic for this text is 8156, and as this is closer to the expected value for random text, we conclude that it is likely statistically random text.

Two other same files are also available: HarrisonBergeron.txt and HarrisonBergeron.vigenere The first is an English short story and the second is an encrypted version. You computations should indicate that the first is English and the second likely is not.

Hints

Make a directory to keep your files for this assignment in.
Take notes as you go along, so you can remember what you've tried already and what did and didn't work.
Break the task down into small pieces.
Use an incremental approach, and try out things on the Python command line as you go, before putting them in your program.
Work through everything one last time prior to your demo to help catch any last-minute problems.
Your assignment solution may be tested with very large data files.
Some coding hints:
- To see if a character in a variable foo is alphabetic, use foo.isalpha().
- To convert a character in a variable foo to uppercase, use foo.upper().
- If you want to make your output prettier, the character '\t' is a tab character.
- The abs() function returns the absolute value.
- The character frequency list does not have to be sorted in any particular order.

Evaluation

You must do two things:

Hand in a printout of your solution to the assignment boxes by 10am Monday, 22 February 2010. Make sure it has your name and/or student ID number on it, and double-check to ensure you're using the correct assignment box! The assignment boxes are on the second floor of the Math Science building. Note that you may turn your printout into the box early.
Demonstrate your solution to your TA in tutorials during the week of February 22. To keep things fair, the solution you demonstrate must be the same as what you handed in on the printout, or you will not be given credit for the demo. Your TA may ask you questions about what the different parts of your solution do. Your TA may also supply different test inputs.

If you do not hand in a printout or if you do not demonstrate your assignment during tutorial time, or both, you cannot be given a grade above a zero on this assignment.

Your printout must show your Python program. Note that you must have a program in a .py file - you cannot turn in a solution that only uses the Python command line.

Your solution must be demonstrated using your account on the CPSC machines.

Tutorials during the week of February 22 are allocated for demos. Your TA is not obliged to see demos outside this time; they have their own schoolwork to do!

The TA has the right to assign a mark of zero for the entire assignment if you fail the demo.

Half of the marks for this assignment are for functionality (i.e., the demo) and half are for your solution (as shown on the printout). Note that, in keeping with the University's assessment criteria, simply having a working solution does not automatically mean that you get full marks. Solutions that show a greater degree of sophistication or that involve bonuses as described above may receive higher marks. Part of the solution marks will be assigned for documentation, like appropriate variable names and comments.