Ask Dave Taylor
  • Facebook
  • Instagram
  • Linkedin
  • Pinterest
  • Twitter
  • YouTube
  • Home
  • YouTube Videos
  • Top Categories
  • Subscribe via Email
  • Ask A Question
  • Meet Dave
  • Home
  • Linux Shell Script Programming
  • How do I analyze word use in a document or book?

How do I analyze word use in a document or book?

December 5, 2008 / Dave Taylor / Linux Shell Script Programming / 8 Comments

I’ve got a bet with my husband and you need to help us settle it. I was reading Jane Austen’s “Pride and Prejudice” and he said “I bet you $20 that “pride” shows up more than “prejudice” in that book.” I think he’s wrong, but have no idea how to figure it out. Can you help?

Holy cow, that’s really what you two bet on? Word frequency in classic literature? Wow. I mean, wow.
Ah, well, be that as it may, you’re in luck because the book you’re talking about appears as part of the brilliant Project Gutenberg library of out-of-copyright literature, so we can download a digital copy of the book for analysis. In fact, you can too, just click on download “Pride and Prejudice” and you’ll be looking at the text of the book in its entirety.
Analyzing it is a bit more tricky and I’ll do the heavy lifting for you because it’s darn hard to do on a Windows computer but quite straightforward if you can pop up a command window a la Linux, which I can do with the Terminal.app program included with Mac OS X (or any Linux system, of course).
Armed with our command line, the first thing I need to do is break down the document into separate words and convert upper case to lower case. This is surprisingly easy to do on the command line:

cat pride-prejudice.txt | tr ‘[:upper:]’ ‘[:lower:]’ | tr ‘ ‘ ‘\
> ‘

The “cat” command displays the contents of a file, then “tr” translates uppercase to lowercase, then, the second time it’s called, converts all spaces to end-of-line returns. What the above command would do if you actually typed it in is to display the entire book on your screen, one word at a time. ZOOOM!
To do something with that output, lets feed it to the oddly-named “grep” command to extract just the word “pride” from the book source:

cat pride-prejudice.txt | tr ‘[:upper:]’ ‘[:lower:]’ | tr ‘ ‘ ‘\
> ‘ | grep pride

You could now count the matches by hand, but since we’re on the command line, we might as well make our lives easier and use “wc” to count the number of lines output. Put it all together and it looks like this:

$ cat pride-prejudice.txt | tr ‘[:upper:]’ ‘[:lower:]’ | tr ‘ ‘ ‘\
‘ | grep pride | wc -l

49

So there you go. The word “pride” appears in the book Pride and Prejudice 49 times. How about “prejudice”?

$ cat pride-prejudice.txt | tr ‘[:upper:]’ ‘[:lower:]’ | tr ‘ ‘ ‘\
‘ | sort | grep prejudice | wc -l

9

Okay, there’s the answer to your bet. “Pride” occurs 49 times in the book, while “prejudice” only appears 9 times. That means, sorry to report, that he’s right and you’re wrong.
Just for fun, since we have the book accessible, let’s pop over to my readability site and find out some more stats on Jane’s book…

readability grades:
Kincaid: 10.6
ARI: 12.2
Coleman-Liau: 10.0
Flesch Index: 65.4
Fog Index: 13.9
Lix: 45.3 = school year 8
SMOG-Grading: 11.3
sentence info:
547423 characters
124991 words, average length 4.38 characters = 1.36 syllables
4805 sentences, average length 26.0 words
54% (2602) short sentences (at most 21 words)
23% (1150) long sentences (at least 36 words)
881 paragraphs, average length 5.5 sentences
4% (200) questions
66% (3173) passive sentences
longest sent 179 wds at sent 2299; shortest sent 1 wds at sent 107
word usage:
verb types:
to be (5550) auxiliary (2913)
types as % of total:
conjunctions 6(7985) pronouns 15(18711) prepositions 12(15203)
nominalizations 2(2220)
sentence beginnings:
pronoun (1961) interrogative pronoun (208) article (282)
subordinating conjunction (194) conjunction (385) preposition (275)

Wow, way more than you probably wanted to know, but notice that it’s rated as approximately 10th grade reading, with 4805 sentences that average 26 words per sentence.
Good luck on your future gambles with your husband too!

Let’s Stay In Touch!

Never miss a single article, review or tutorial here on AskDaveTaylor, sign up for my fun weekly newsletter!
Name: 
Your email address:*
Please enter all required fields
Correct invalid entries
No spam, ever. Promise. Powered by FeedBlitz
Please choose a color:
Starbucks coffee cup I do have a lot to say, and questions of my own for that matter, but first I'd like to say thank you, Dave, for all your helpful information by buying you a cup of coffee!

8 comments on “How do I analyze word use in a document or book?”

  1. Antriksh Pany says:
    March 5, 2010 at 8:00 pm

    Another way could be:
    $ grep -o -i pride pride-prejudice.txt | wc -l
    Also, I think it might be better to pass ‘-w’ to grep too, just in case other words contain the string “pride”.

    Reply
  2. Sujay Kumar says:
    December 3, 2009 at 8:54 am

    Hi Dave,
    Hope this will be an easy way.
    $ cat pride | tr ‘ ‘ ‘\n’ | grep -i -c pride

    Reply
  3. Howard Hong says:
    October 31, 2009 at 1:32 pm

    Oops sorry, that previous command should be
    $ cat pride-prejudice.txt | tr ‘[:upper:]’ ‘[:lower:]’ | tr ‘ ‘ ‘\n’ | grep -e pride -e prejudice | sort | uniq -c

    Reply
  4. Howard Hong says:
    October 30, 2009 at 3:40 pm

    Hi Dave,
    Here’s a slight optimization. Instead of running the commands twice, you can use the grep -e option and the uniq command and get both results in a one-liner.
    $ cat pride-prejudice.txt | tr ‘[:upper:]’ ‘[:lower:]’ | tr ‘ ‘ ‘\n’ | grep -e pride -e prejudice | uniq -c

    Reply
  5. John Rocha says:
    December 19, 2008 at 12:59 am

    Hi Dave
    I’m an applied linguist as well as a photographer.
    If you want to find all the words in a text (usually pure txt) then you can use a condordancer. Press the button and you’ll get a list of all the words and lots of other info.
    Here’s a chunk from Wikipedia. Lots of concordancers are free like AntConc
    Hope this helps
    Cheers John
    Concordancers are also used in corpus linguistics to retrieve alphabetically or otherwise sorted lists of linguistic data from the corpus in question, which the corpus linguist then analyzes. Some concordancers used in corpus linguistics are AntConc (Freeware), ApSIC Xbench, WordSmith, MonoConc, GlossaNet, and CorpusEye.

    Reply
  6. Dave Taylor says:
    December 9, 2008 at 4:06 pm

    Hi Dan. I bet you’ll read this answer! 🙂
    It is true that each major release of Mac OS X has included a different version of NetBSD / Darwin / whatever-you-want-to-call-it, the underlying Linux operating system.
    You need to be careful with what you type in too. Yes ‘\r’ will work, but ‘\ >’ won’t work: you need a single character and you’re showing two, one that’s a protected space (“\ “) and one that’s the right angle bracket. However, this will work:
    tr ‘x’ ‘
    ‘
    where you literally open the quoted passage then press the Return key on your keyboard, then close the quote.

    Reply
  7. Dan says:
    December 7, 2008 at 10:52 am

    Dave, I think you are easily duped into doing programming homework when it is presented to you as a “bet.” 🙂
    I tried to follow along on my own command line and had some difficulty. On my MacOSX the character for carriage return in the tr command is ‘\r’ instead of ‘\ > ‘. It took me several minutes of squinting at the man files to figure this out.
    I am wondering why little things like this are different from one computer to the next.

    Reply
  8. Glen Turpin says:
    December 5, 2008 at 9:52 am

    Dave, this is a great example of the power of the command line, but mere mortals can achieve the same result without having to learn unix-fu.
    Just open up the document in a word processor and find/replace “pride” and “prejudice” with other words. The find/replace dialog box in most word processing applications will give you the option to ignore case and will tell you how many instances have been changed.

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Recent Posts

  • How Do You Rearrange App Icons on an Android Phone?
  • How Can I Enable Emergency Alerts in Spanish on Android?
  • Switch from 24-Hour Time to AM/PM in Ubuntu Linux?
  • Protect Your Connection and Privacy with Surfshark VPN
  • Can I Send Texts in iMessage with Effects from my Mac System?

On Our YouTube Channel

Monoprice DT-3BT Bluetooth Desktop Speakers -- REVIEW

FATORK Wi-Fi Smart Portable Movie Projector -- DEMO & REVIEW

Categories

  • AdSense, AdWords, and PPC Help (106)
  • Amazon, eBay, and Online Shopping Help, (161)
  • Android Help (203)
  • Apple iPad Help (145)
  • Apple Watch Help (53)
  • Articles, Tutorials, and Reviews (344)
  • Auto Tech Help (12)
  • Business Advice (199)
  • Chrome OS Help (25)
  • Computer & Internet Basics (764)
  • d) None of the Above (165)
  • Facebook Help (383)
  • Google, Chrome & Gmail Help (180)
  • HTML & Web Page Design (245)
  • Instagram Help (48)
  • iPhone & iOS Help (607)
  • iPod & MP3 Player Help (173)
  • Kindle & Nook Help (93)
  • LinkedIn Help (85)
  • Linux Help (167)
  • Linux Shell Script Programming (87)
  • Mac & MacOS Help (895)
  • Most Popular (16)
  • Outlook & Office 365 Help (26)
  • PayPal Help (69)
  • Pinterest Help (53)
  • Reddit Help (18)
  • SEO & Marketing (81)
  • Spam, Scams & Security (93)
  • Trade Show News & Updates (23)
  • Twitter Help (217)
  • Video Game Tips (66)
  • Web Site Traffic Tips (62)
  • Windows PC Help (922)
  • Wordpress Help (204)
  • Writing and Publishing (72)
  • YouTube Help (46)
  • YouTube Video Reviews (159)
  • Zoom, Skype & Video Chat Help (57)

Archives

Social Connections:

Ask Dave Taylor


Follow Me on Pinterest
Follow me on Twitter
Follow me on LinkedIn
Follow me on Instagram


AskDaveTaylor on Facebook



microsoft insider mvp


This web site is for the purpose of disseminating information for educational purposes, free of charge, for the benefit of all visitors. We take great care to provide quality information. However, we do not guarantee, and accept no legal liability whatsoever arising from or connected to, the accuracy, reliability, currency or completeness of any material contained on this site or on any linked site. Further, please note that by submitting a question or comment you're agreeing to our terms of service, which are: you relinquish any subsequent rights of ownership to your material by submitting it on this site. Our lawyer says "Thanks for your cooperation."
© 2022 by Dave Taylor. "Ask Dave Taylor®" is a registered trademark of Intuitive Systems, LLC.
Privacy Policy - Terms and Conditions - Accessibility Policy