Due: see Evaluation below
You'll be applying what you've learned about Python to perform some data processing on a text file. In particular, you'll be writing Python programs using if-statements, loops and different Python data types.
Your program will try to determine if its input is English or random, by computing a value called the phi-statistic.
The phi-statistic is a measure of how closely the frequency distribution of the individual characters in text match that of English. In English text, the individual characters tend to occur with fairly consistent frequencies. For example, the letter ``E'' is most common, followed by ``T'', ``A'', etc. A collection of random characters will have frequencies that are all roughly the same.
To compute the phi-statistic of an English text requires the following steps:
One application of the phi-statistic is to automatically recognize English text. This is done by comparing the expected or average value of phi for both English text and random text with the computed value. The expected value of phi for English text is
Compute the phi-statistic for a piece of text and decide whether it is English or random.
cat poem.txt | python as2.pyThis causes the raw_input() commands in your program to read the lines of the file, instead of waiting for user input.
The phi-statistic for this text is 11992, and there are 444 alphabetic characters. The expected values of phi for English and random texts are 13001.3412. and 7572.6420, respectively. As the computed value is closer to the expected value for English, the conclusion is that it is English.
Your program's output might look like this:
Frequency list:
A 35
C 18
B 9
E 50
D 18
G 6
F 15
I 25
H 31
K 2
M 17
L 21
O 35
N 29
Q 1
P 6
S 23
R 22
U 16
T 43
W 8
V 4
Y 10
444 characters total
Phi statistic is 11992.0
Expected English value is 13001.3412
Expected random value is 7572.642
The input is probably English
The file poem.vigenere contains an encrypted version of the same poem in which the frequency counts should be more evenly distributed (making the code harder to break). The phi-statistic for this text is 8156, and as this is closer to the expected value for random text, we conclude that it is likely statistically random text.
Two other same files are also available: HarrisonBergeron.txt and HarrisonBergeron.vigenere The first is an English short story and the second is an encrypted version. You computations should indicate that the first is English and the second likely is not.
foo is alphabetic,
use foo.isalpha().
foo to uppercase,
use foo.upper().
'\t' is a tab character.
abs() function returns the absolute value.
You must do two things:
Your printout must show your Python program. Note that you must have a program in a .py file - you cannot turn in a solution that only uses the Python command line.
Your solution must be demonstrated using your account on the CPSC machines.
Tutorials during the week of February 22 are allocated for demos. Your TA is not obliged to see demos outside this time; they have their own schoolwork to do!
The TA has the right to assign a mark of zero for the entire assignment if you fail the demo.
Half of the marks for this assignment are for functionality (i.e., the demo) and half are for your solution (as shown on the printout). Note that, in keeping with the University's assessment criteria, simply having a working solution does not automatically mean that you get full marks. Solutions that show a greater degree of sophistication or that involve bonuses as described above may receive higher marks. Part of the solution marks will be assigned for documentation, like appropriate variable names and comments.