Computer Science 217
Assignment #2
Due: see Evaluation below
You'll be applying what you've learned about Python to perform some
data processing on a text file. In particular, you'll be writing
Python programs using if-statements, loops and different Python data
types.
Your program will try to determine if its input is English or random,
by computing a value called the phi-statistic.
- This is an individual assignment. What you submit must be your own
work, although you may discuss the problem in general terms
with other people. You should definitely not be showing other people
your code, and generally speaking, it is not a good idea to talk about
the assignment when you're sitting in front of the computer.
- Sources of algorithms and code, if any, must be properly cited.
Remember that plagiarism regulations apply to code too.
You can put citations into comments in your Python code.
- If you have any questions about what you can and can't safely do,
feel free to email me.
The phi-statistic is a measure of how closely the frequency
distribution of the individual characters in text match that of
English. In English text, the individual characters tend to occur
with fairly consistent frequencies. For example, the letter ``E'' is
most common, followed by ``T'', ``A'', etc. A collection of random
characters will have frequencies that are all roughly the same.
To compute the phi-statistic of an English text requires the
following steps:
- Compute the number of occurrences of each character in the
text.
- Suppose that character i occurs fi times and that there
are n distinct characters. Then
phi = f1 (f1 - 1) + f2 (f2 - 1) + ... + fn (fn - 1)
For example, the text
HKWZA RRPVQ BIVYS MPDMQ MBUDC
has phi-statistic equal to 18 (there are 6 characters that
occur twice, one that occurs 3 times, and the rest occur once).
One application of the phi-statistic is to automatically recognize
English text. This is done by comparing the expected or average value
of phi for both English text and random text with the computed
value. The expected value of phi for English text is
expectedEnglish = 0.0661 N (N-1)
where N is the number of characters in the text. For random text,
it is
expectedRandom = 0.0385 N (N-1) .
The previous example has N=25, so we compute
expectedEnglish = 0.0661 * 25 * (24) = 39.66
and
expectedRandom = 0.0385 * 25 * (24) = 23.1 .
Thus, since the computed value of 18 is closer to 23.1 than to
39.66, we conclude that the text is most likely a random collection
of characters (which, in fact, it is!).
Compute the phi-statistic for a piece of text and decide whether it
is English or random.
- Input.
- A series of lines of text. The last line will
contain only the word END as a sentinel (to indicate the end
of the data).
- Processing.
- Compute the phi-statistic corresponding to all the
text contained in the lines of input. For this computation,
you should ignore all non-alphabetic characters, and process
each letter as its upper-case equivalent, i.e., treat ``A''
and ``a'' as the same character. Also, compute the expected
values of phi for English and random text of the same
length as the input, counting only the alphabetic characters.
- Output.
- Print out the character frequency list, the total
number of characters, the value of the phi-statistic, and
the expected values of English and random text with the same
number of characters. Finally, output whether or not the text
is more likely English or random characters.
- Example.
- The sample input file
poem.txt
contains a poem written in English. You can either enter each line
of text manually into your program, or pipe the contents of
the file directly into Python
cat poem.txt | python as2.py
This causes the raw_input() commands in your program
to read the lines of the file, instead of waiting for user
input.
The phi-statistic for this text is 11992, and there are
444 alphabetic characters. The expected values of phi for
English and random texts are 13001.3412. and 7572.6420,
respectively. As the computed value is closer to the expected
value for English, the conclusion is that it is English.
Your program's output might look like this:
Frequency list:
A 35
C 18
B 9
E 50
D 18
G 6
F 15
I 25
H 31
K 2
M 17
L 21
O 35
N 29
Q 1
P 6
S 23
R 22
U 16
T 43
W 8
V 4
Y 10
444 characters total
Phi statistic is 11992.0
Expected English value is 13001.3412
Expected random value is 7572.642
The input is probably English
The file
poem.vigenere
contains an encrypted version of the same poem in which the
frequency counts should be more evenly distributed (making the
code harder to break). The phi-statistic
for this text is 8156, and as this is closer to the expected
value for random text, we conclude that it is likely
statistically random text.
Two other same files are also available:
HarrisonBergeron.txt
and
HarrisonBergeron.vigenere
The first is an English short story and the second is an
encrypted version. You computations should indicate that the
first is English and the second likely is not.
- Make a directory to keep your files for this assignment in.
- Take notes as you go along, so you can remember what you've
tried already and what did and didn't work.
- Break the task down into small pieces.
- Use an incremental approach, and try out things on the Python
command line as you go, before putting them in your program.
- Work through everything one last time prior to your demo to
help catch any last-minute problems.
- Your assignment solution may be tested with very large data files.
- Some coding hints:
- To see if a character in a variable
foo
is alphabetic,
use foo.isalpha()
.
- To convert a character in a variable
foo
to uppercase,
use foo.upper()
.
- If you want to make your output prettier, the character
'\t'
is a tab character.
- The
abs()
function returns the absolute value.
- The character frequency list does not have to be sorted in
any particular order.
You must do two things:
- Hand in a printout of your solution to the assignment boxes by
10am Monday, 22 February 2010. Make sure it has your name
and/or student ID number on it, and double-check to ensure
you're using the correct assignment box! The assignment boxes are
on the second floor of the Math Science building. Note that
you may turn your printout into the box early.
- Demonstrate your solution to your TA in tutorials during the
week of February 22. To keep things fair,
the solution you demonstrate must
be the same as what you handed in on the printout, or you will
not be given credit for the demo. Your TA may ask you
questions about what the different parts of your
solution do. Your TA may also supply different test inputs.
If you do not hand in a printout or if you do not demonstrate your
assignment during tutorial time, or both, you cannot be given a grade
above a zero on this assignment.
Your printout must show your Python program.
Note that you must have a program in a
.py file - you cannot turn in a solution that only
uses the Python command line.
Your solution must be demonstrated using your account on the CPSC
machines.
Tutorials during the week of February 22 are allocated for demos.
Your TA is not obliged to see demos outside this time; they have their
own schoolwork to do!
The TA has the right to assign a mark of zero for the entire assignment
if you fail the demo.
Half of the marks for this assignment are for functionality (i.e., the
demo) and half are for your solution (as shown on the printout). Note
that, in keeping with the University's assessment criteria, simply
having a working solution does not automatically mean that
you get full marks. Solutions that show a greater degree of
sophistication or that involve bonuses as described above may receive
higher marks. Part of the solution marks will be assigned for
documentation, like appropriate variable names and comments.