Industry guru Dave Taylor offers free tech support on a wide variety of technical and business topics, including HTML, Apple iPhone, online advertising, Cascading Style Sheets, Web design, management, Unix, Linux, search engine optimization, online dating, Mac OS X, shell script programming and Microsoft Windows.

How can I count letters in a text file?

Hey I want a Perl script that reads a file and sends me the number of occurrences of the alphabets in that file... Could you please help me?


Dave's Answer:

No I can't. In fact, I don't really like being "Dave's Homework Helper Site" because the point of homework is for you to practice the requisite skills so you can get better at whatever you're studying. (why yes, I am a teacher)

On the other hand... :-)

This is an interesting little puzzle so what I will do is show how to write a quick, short Bourne Shell script for Linux (or Mac OS X if you crack open your Terminal.app program) that can do what you seek.

The key idea is that if you could transform the input to be one-character-per-line, it'd be unreadable for humans, but would make it really, really easy to sort and tally for a computer program.

How do you do that? With one of what I call the unsung heroes of the Unix command line, the "fold" command. Generally, people use fold to wrap overly long lines in text files (it's great for processing info prior to printing it, for example) but as with all great Unix command line utilities, it has parameters that let you change its behavior. And that's just what we'll do.

Try this yourself:

$ date | fold -w3
Wed 
Nov 
26 0
9:56
:48 
MST 
2008

That's with width=4. Turn it into "-w1" and each and every character is on its own line. (I won't reproduce it here because it's crazy long and you get the idea anyway, I hope!)

Now that each character is on its own line, it's simple to sort the output to ensure that they're in alphabetical order with "sort". To tally matching lines turns out to be a feature of another of the unsung heroes, "uniq". Check its man page and you'll see:

-c    Precede each output line with the count of the number of times the line occurred in the input, followed by a single space.

That's what we want, the "-c" flag. Now, to put them together:

$ date | fold -w1 | sort | uniq -c
   5  
   4 0
   2 1
   3 2
   1 5
   1 6
   1 8
   2 :
   1 M
   1 N
   1 S
   1 T
   1 W
   1 d
   1 e
   1 o
   1 v

Uh oh, there's a problem: we really want to count upper case and lower case together, so we'll need to slip one more transform into the pipeline, and we'll do it with mnemonic set names, which you might not have seen before:

tr '[:upper:]' '[:lower:]'

Put this into the pipeline and give it something more interesting to chew on (the "bash" man page) and here's what we get:

$ cat sample-input.txt | fold -w1 | \
     tr '[:upper:]' '[:lower:]' | sort | uniq -c
  13 
 570  
   3 !
   3 $
  19 '
   1 (
   1 )
  33 ,
   4 -
  19 .
   4 0
   1 1
   4 2
   1 3
   1 4
   2 5
   1 7
   2 8
   8 9
  10 :
   5 ?
 184 a
  43 b
  77 c
 102 d
 304 e
  49 f
  42 g
 105 h
 164 i
   1 j
  25 k
 122 l
  67 m
 165 n
 257 o
  55 p
   3 q
 141 r
 173 s
 259 t
  95 u
  40 v
  40 w
   3 x
  93 y
   3 z

This is a bit hard to read, so let's add one more "grep" at the end so we can just isolate the vowels and have our output readable:

$ cat sample-text.txt | fold -w1 | tr '[:upper:]' '[:lower:]' | \
     sort | uniq -c | grep -E '(a|e|i|o|u)'
 184 a
 304 e
 164 i
 257 o
  95 u

There ya go. You can easily drop this sequence of Unix / Linux commands into a shell script and use whatever file or files are specified as the input.

Notice here that as I expect, "E" is the most common vowel, followed by "O", "A", "I" and, way down the list, "U". In fact, when I feed a much larger corpus of work to this script, here's what I end up with: a (8,621), e (12,419), i (7,497), o (9,195) and u (3,660). Again, "E" is the most common, followed by "O" and fairly closely by "A". If you ever play hangman, it's very useful to know letter frequencies!



Help others find this article at Del.icio.us, Digg, Netscape, Reddit, and Stumble Upon    

Subscribe!

Never miss another useful Q&A article again! Subscribe to AskDaveTaylor with Google Reader.

Comments
Rather amazingly, there are no comments on this article yet.

I have something to say, now that you mention it, but ...
Starbucks coffee cup I do have a lot to say, and questions of my own for that matter, but first I'd like to say thank you for all your efforts on this Web site by buying you a cup of coffee!

I do have a comment, now that you mention it!











Remember personal info?


Please note that I will never send you any unsolicited email. Ever.

While I'm at it, please note that by submitting a question or comment you're agreeing to my terms of service, which are: you relinquish any subsequent rights of ownership to your material by submitting it on this site.








Ask Dave Taylor: The iPhone App: Advertisement



Follow me on Twitter @DaveTaylor

Search
Find just the answers you seek from among our 2300+ free tech support articles by using our Lijit search engine.


Help!





Subscribe to
Ask Dave Taylor!

Add to Google Reader
Add to My Yahoo!
Subscribe in NewsGator Online

RDF   XML

Free Updates!
Sign up and get free weekly updates and special offers on books, seminars, workshops and more.


Recent Entries
Book Links
© 2002 - 2010 by Dave Taylor. All Rights Reserved.

Note: This web site is for the purpose of disseminating information for educational purposes, free of charge, for the benefit of all visitors. We take great care to provide quality information. However, we do not guarantee, and accept no legal liability whatsoever arising from or connected to, the accuracy, reliability, currency or completeness of any material contained on this web site or on any linked site.

[whiteboard marker tray]
"Ask Dave Taylor®" is a registered trademark of Intuitive Systems, LLC.