Saturday, 22 May 2010

Python text document stats script

Desperately I spend the last three weeks refreshing my knowledge of Python, the easy peasy scripting language I'd once used in my youth to test CD players and hi-fi RS232 commands. I re-acquired Dive Into Python and Python For Dummies and ploughed through.

Sadly I didn't get the job I was learning it for, but instead refired my enthusiasm for learning programming languages. So, I've acquired Learning Perl and when that is done I'll be trying my hand at Ruby, just like the cool kids use.

I wrote me a wee program in Python that analyzes the text file of my 2001 novel Shag Times and comes up with various statistics for it.

I plan to re-write the same program in Perl as soon as I've finished reading Learning Perl, and then write the same program again in Ruby if that's even possible.
### Program to do the following
###  * open shagtimes.txt
###  * provide a word count
###  * count unique words        
###  * provide top ten most popular words
###  * provide all single occurance words
###  * calculate average word length
###  * find longest word         

import textwrap

# Opening Shag Times and processing it a bit
book = open('shagtimes.txt')
book = book.read()
book = book.lower()
### Code to remove punctuation
stuff_to_replace_with_space = (".", ",", "?", "/", "=", "-", ";", ":")
stuff_to_remove = ("\'", "(", ")", "\"")
for item in stuff_to_replace_with_space:
book = book.replace(item, " ")
for item in stuff_to_remove:
book = book.replace(item, "")
print "======================================="

# Doing the word count
book = book.split()
wordcount = len(book)
print "The document contains %i words in total" % wordcount
print "======================================="

# Doing the unique count
uniques = []
for item in book:
if item not in uniques:
uniques.append(item)
uniquecount = len(uniques)
print "The document contains %i unique words" % uniquecount
print "======================================="
uniques.sort()

# Finding top ten popular words
print "The top ten most used words:-"
occurancelist = []
for item in uniques:
occurances = book.count(item)
occurancelist.append((occurances, item))
occurancelist.sort()
occurancelist.reverse()
for item in occurancelist[:10]:
print item
print "======================================="

# Finding single use words
print "Words that were used only once:-"
singleuse = []
singleusecount = 0
for item in occurancelist:
c, w = item
if c == 1:
singleuse.append(w)
singleusecount+=1
singles = ""
while singleuse:
for item in singleuse:
singles = singles + (singleuse.pop()) + ", "
singles = textwrap.wrap(singles, width=70)
for i in singles:
print i
print "======================================="
print "A total of %i words were used only once" % singleusecount
print "======================================="

# Finding average word length
chartotal = 0.000
for item in book:
chartotal = chartotal + len(item)
avechar = chartotal/wordcount
print "The average word length was %.3f letters long" % avechar
print "======================================="

# Finding longest word
longlength = 0
for item in uniques:
if len(item) > longlength:
longlength = len(item)
print "The longest word was %i letters long" % longlength
print "These words were that long:-"
for item in uniques:
if len(item) == longlength:
print item
print "======================================="

Can someone recommend me a good book for learning to program in Ruby?

No comments:

Post a Comment