|
|
How can I count letters in a text file?Hey I want a Perl script that reads a file and sends me the number of occurrences of the alphabets in that file... Could you please help me? No I can't. In fact, I don't really like being "Dave's Homework Helper Site" because the point of homework is for you to practice the requisite skills so you can get better at whatever you're studying. (why yes, I am a teacher) On the other hand... :-) This is an interesting little puzzle so what I will do is show how to write a quick, short Bourne Shell script for Linux (or Mac OS X if you crack open your Terminal.app program) that can do what you seek. The key idea is that if you could transform the input to be one-character-per-line, it'd be unreadable for humans, but would make it really, really easy to sort and tally for a computer program. How do you do that? With one of what I call the unsung heroes of the Unix command line, the "fold" command. Generally, people use fold to wrap overly long lines in text files (it's great for processing info prior to printing it, for example) but as with all great Unix command line utilities, it has parameters that let you change its behavior. And that's just what we'll do. Try this yourself: $ date | fold -w3 Wed Nov 26 0 9:56 :48 MST 2008 That's with width=4. Turn it into "-w1" and each and every character is on its own line. (I won't reproduce it here because it's crazy long and you get the idea anyway, I hope!) Now that each character is on its own line, it's simple to sort the output to ensure that they're in alphabetical order with "sort". To tally matching lines turns out to be a feature of another of the unsung heroes, "uniq". Check its man page and you'll see:
That's what we want, the "-c" flag. Now, to put them together: $ date | fold -w1 | sort | uniq -c 5 4 0 2 1 3 2 1 5 1 6 1 8 2 : 1 M 1 N 1 S 1 T 1 W 1 d 1 e 1 o 1 v Uh oh, there's a problem: we really want to count upper case and lower case together, so we'll need to slip one more transform into the pipeline, and we'll do it with mnemonic set names, which you might not have seen before: tr '[:upper:]' '[:lower:]' Put this into the pipeline and give it something more interesting to chew on (the "bash" man page) and here's what we get: $ cat sample-input.txt | fold -w1 | \ tr '[:upper:]' '[:lower:]' | sort | uniq -c 13 570 3 ! 3 $ 19 ' 1 ( 1 ) 33 , 4 - 19 . 4 0 1 1 4 2 1 3 1 4 2 5 1 7 2 8 8 9 10 : 5 ? 184 a 43 b 77 c 102 d 304 e 49 f 42 g 105 h 164 i 1 j 25 k 122 l 67 m 165 n 257 o 55 p 3 q 141 r 173 s 259 t 95 u 40 v 40 w 3 x 93 y 3 z This is a bit hard to read, so let's add one more "grep" at the end so we can just isolate the vowels and have our output readable: $ cat sample-text.txt | fold -w1 | tr '[:upper:]' '[:lower:]' | \ sort | uniq -c | grep -E '(a|e|i|o|u)' 184 a 304 e 164 i 257 o 95 u There ya go. You can easily drop this sequence of Unix / Linux commands into a shell script and use whatever file or files are specified as the input. Notice here that as I expect, "E" is the most common vowel, followed by "O", "A", "I" and, way down the list, "U". In fact, when I feed a much larger corpus of work to this script, here's what I end up with: a (8,621), e (12,419), i (7,497), o (9,195) and u (3,660). Again, "E" is the most common, followed by "O" and fairly closely by "A". If you ever play hangman, it's very useful to know letter frequencies!
More Useful Shell Script Programming Articles:
✔ Secretly capture screenshots on my Mac?
When I used to work on a Linux system, there was a utility we had that would let me take screen captures every...
✔ Parsing "id" strings in a Shell Script?Hello Dave. I need a Bash shell script that creates a directories with the group names automatically when user logs in to the...
✔ Copy and Paste from the Mac OS X Command Line?I am constantly running commands in Terminal.app on my MacBook and then copying and pasting the results into email messages or documents. Yes,...
✔ Script to test line lengths for Twitter compatibility?I've been tasked with writing a series of tweets for a Black Friday marketing campaign and am finding it a bit tricky because...
✔ Shell script to convert lowercase to title case?As part of a project I'm working on, I find myself deep in a Linux shell script, needing to have a subroutine that...
Let's stay in touch!
Sign up for my weekly AskDaveTaylor Newsletter and you'll receive even more tech and gadget help
right to your inbox, along with exclusive news and industry updates. It's good stuff. I promise!
Categorized:
Shell Script Programming
(Article 8630,
Written by Dave Taylor)
Tagged: games, hangman, perl, shell script programming Previous: How to find your friends on Facebook with the Facebook Friend Finder Next: How can I protect my Twitter updates and messages? Reader Comments To Date: 1
I do have a comment, now that you mention it!Check This Out Too... |
Recent Entries
Look for Answers
Recommended
All Our Categories
Apple iPad Help
Articles and Reviews Auctions and Online Shopping Blogs and Blogging Building Web Site Traffic Business and Management Computer and Internet Basics d) None of the Above Facebook Help Google Gmail Help Google Plus Help HTML, JavaScript and Web Site Programming Industry News and Trade Shows iPhone and Cell Phone Help iPod, Sony PSP and MP3 Player Help Kindle Fire Help Mac OS X Help Pay Per Click (PPC) Advertising Pinterest Help Search Engine Optimization (SEO) Shell Script Programming Tech Support Video Help The Writing Business Twitter, LinkedIn and Social Network Help Unix and Linux Help Video Game Tips and Help Windows PC Help Find Me on Google+ ADT on G+ |
This is a great article. I was looking for this information so that I could write a script that would help me cheat at hangman. This was exactly the info I needed.
I used it as a part of a script that can, based on user input, figure out the most likely words in a game of hangman (by parsing /usr/share/dict/words), and tell you what the best letter to guess next would be.
However, I noticed a few inefficiencies with some of what you wrote. Especially, a useless use of cat.
"cat sample-input.txt | fold -w1" can be rewritten as "<sample-input.txt fold -w1" which gets rid of an unnecessary process.
Also, unless you're using non-English text, "tr '[:upper:]' '[:lower:]'" can be more quickly written as "tr A-Z a-z". I think it looks a little nicer, although it doesn't have any practical benefit, and as I said, would mostly just work for English text, since it wouldn't include accented characters, or non-Latin letters.