I’ve got a bet with my husband and you need to help us settle it. I was reading Jane Austen’s “Pride and Prejudice” and he said “I bet you $20 that “pride” shows up more than “prejudice” in that book.” I think he’s wrong, but have no idea how to figure it out. Can you help?
Holy cow, that’s really what you two bet on? Word frequency in classic literature? Wow. I mean, wow.
Ah, well, be that as it may, you’re in luck because the book you’re talking about appears as part of the brilliant Project Gutenberg library of out-of-copyright literature, so we can download a digital copy of the book for analysis. In fact, you can too, just click on download “Pride and Prejudice” and you’ll be looking at the text of the book in its entirety.
Analyzing it is a bit more tricky and I’ll do the heavy lifting for you because it’s darn hard to do on a Windows computer but quite straightforward if you can pop up a command window a la Linux, which I can do with the Terminal.app program included with Mac OS X (or any Linux system, of course).
Armed with our command line, the first thing I need to do is break down the document into separate words and convert upper case to lower case. This is surprisingly easy to do on the command line:
The “cat” command displays the contents of a file, then “tr” translates uppercase to lowercase, then, the second time it’s called, converts all spaces to end-of-line returns. What the above command would do if you actually typed it in is to display the entire book on your screen, one word at a time. ZOOOM!
To do something with that output, lets feed it to the oddly-named “grep” command to extract just the word “pride” from the book source:
> ‘ | grep pride
You could now count the matches by hand, but since we’re on the command line, we might as well make our lives easier and use “wc” to count the number of lines output. Put it all together and it looks like this:
‘ | grep pride | wc -l
So there you go. The word “pride” appears in the book Pride and Prejudice 49 times. How about “prejudice”?
‘ | sort | grep prejudice | wc -l
Okay, there’s the answer to your bet. “Pride” occurs 49 times in the book, while “prejudice” only appears 9 times. That means, sorry to report, that he’s right and you’re wrong.
Just for fun, since we have the book accessible, let’s pop over to my readability site and find out some more stats on Jane’s book…
readability grades: Kincaid: 10.6 ARI: 12.2 Coleman-Liau: 10.0 Flesch Index: 65.4 Fog Index: 13.9 Lix: 45.3 = school year 8 SMOG-Grading: 11.3 sentence info: 547423 characters 124991 words, average length 4.38 characters = 1.36 syllables 4805 sentences, average length 26.0 words 54% (2602) short sentences (at most 21 words) 23% (1150) long sentences (at least 36 words) 881 paragraphs, average length 5.5 sentences 4% (200) questions 66% (3173) passive sentences longest sent 179 wds at sent 2299; shortest sent 1 wds at sent 107 word usage: verb types: to be (5550) auxiliary (2913) types as % of total: conjunctions 6(7985) pronouns 15(18711) prepositions 12(15203) nominalizations 2(2220) sentence beginnings: pronoun (1961) interrogative pronoun (208) article (282) subordinating conjunction (194) conjunction (385) preposition (275)
Wow, way more than you probably wanted to know, but notice that it’s rated as approximately 10th grade reading, with 4805 sentences that average 26 words per sentence.
Good luck on your future gambles with your husband too!