Single Letter Frequencies in English
Every time that I read a paper that discusses the frequencies of single letters in English, I feel like I should sit down and calculate them for myself from a sample of English text. Today, I finally did. Here are the probabilities and negative log probabilities of the characters in English over the corpus of Shakespeare’s plays:
And, for those who care, here’s the code to generate the data from the plays, which I downloaded from Project Gutenberg:
def initialize_letter_counts(letter_counts)
('a'..'z').each do |chr|
letter_counts[chr] =
end
end
def parse_file(filename, letter_counts)
f = File.new(filename)
begin
while 1
char = f.readchar().chr.downcase
if char.match(/[a-z]/)
letter_counts[char] = letter_counts[char] + 1
end
end
rescue EOFError
return nil
end
end
directory = '/Users/johnmyleswhite/Princeton/Research/Letter Frequency'
Dir.chdir(directory)
letter_counts = {}
initialize_letter_counts(letter_counts)
Dir.new('Data').entries.each do |entry|
if entry.match(/\.txt$/)
entry = File.expand_path(entry, directory + '/Data')
parse_file(entry, letter_counts)
end
end
letter_counts.keys.sort.each do |key|
puts "'#{key}',#{letter_counts[key]}"
end