As part of a project I’m working on, I find myself deep in a Linux shell script, needing to have a subroutine that converts a sentence of all lowercase to title case. You know, from “this is a test case” to “This is a Test Case”. Not every word, just the right ones. Doable?
That’s an interesting project to chew on, actually, because there are a couple of very different ways to address the problem. Actually, that’s pretty typical of Linux shell script programming because there are so many different commands in Linux.
The first thought I had was to use “sed” in a loop, basically doing something like “find [ ][a-z]” and replace it with “[ ][A-Z]” but that both strikes me as inefficient and a solution that’ll make it just about impossible to skip words that shouldn’t be capitalized.
Instead, the more logical solution seems to be breaking the sentence down into individual words, testing the words against the “skip” words, then fixing each of the remaining words before the fixed sentence is reassembled.
The trickiest part is to fix the first letter of the word, right? Or is it…
In fact, here’s an easy way to break a word down into the first letter and the remaining letters:
otherletters=$(echo $word | cut -c2-)
It’s one of my favorite commands, “cut”, and we’re using its ability to chop up what it’s given character-by-character. Then “tr” transliterates lowercase to uppercase.
The main loop is pretty straightforward:
do
per-word code goes here
done
The most interesting part is perhaps how to skip words that shouldn’t be capitalized. After thinking about a couple of possibilities, here’s what I came up with:
the|and|an|or|a|of) /bin/echo -n “$word “; continue; ;;
esac
The problem? If the first word is one of these stop words, it still needs to be capitalized, so it’s a bit more nuanced: these words should only be skipped if they aren’t the first word that appears in the sentence. Which means we’re going to need to keep track of how many words we’ve scanned…
Easily done, though. Super easy. Just add a conditional around the “case” statement:
The fastest way to keep incrementing the counter variable is to use the shell’s built-in mathematical capabilities:
That’s 95% of things, so I’ll let you put all the pieces together properly to get it to work…
Hmm, I wish I found this before I got into 30 lines of almost-completed SED script!
I agree with you, Dave that this is, indeed, a fascinating project to chew on! I’ve noticed that there are lots of subtleties that your answer didn’t touch on, though. I’ll outline here those subtleties as I see them, and I’ll also outline how I think that a refined shell script that takes these subtleties into account should work. I haven’t tried to write a finished script, but there should be more than enough information here, in what follows, for anyone with a little scripting experience to do that.
I also agree with you, Dave, when you said that the easiest way to attack this problem is probably to break the input sentence or text up into individual words, process each word to determine whether to capitalize it (or not), then reassemble all of the processed words back into a sentence, at the end of the shell script.
There are several additional considerations not covered in your answer to the original poster.
I wanted to refresh my memory about what the English language experts say is the proper way to do a title, so I looked it up. I did a little online research, but the rules for title capitalization that I liked the most were those given in the book, “The Chicago Manual of Style,” a well-respected writer’s reference work for such things. Here’s what “Chicago” has to say about this:
“The Chicago Manual of Style,” 14th ed., Sect. 7.127, “,” pp. 282-283 (note that this isn’t the latest edition; I’m guessing this won’t matter for our purposes here; this is merely the ed. I happen to have in my personal library):
“In regular title capitalization, also known as headline style, the first and last
words and all nouns, pronouns, adjectives, verbs, adverbs, and subordinating conjunctions
(‘if,’ ‘because,’ ‘as,’ ‘that,’ etc.) are capitalized. Articles (‘a,’ ‘an,’ ‘the’),
coordinating conjunctions (‘and,’ ‘but,’ ‘or,’ ‘for,’ ‘nor’), and prepositions,
regardless of length, are lowercased unless they are the first or last word of the title
or subtitle. The ‘to’ in infinitives is also lowercased. Long titles of works published
in earlier centuries may retain the original capitalization, except that any word in full
capitals should carry only an initial capital. No word in a quoted title should ever be
set in full capitals, regardless of how it appears on the title page of the book itself,
unless it is an acronym, such as WAC, UNICEF, or FORTRAN.”
Again, the above rules are from “Chicago,” and other references might have some differences.
I confess that, when I first read your answer to the Original Poster (OP), I had forgotten that one is supposed to capitalize not only the first word of a title, but also the LAST word, as well. But, when I read the above passage from “Chicago,” something clicked in the cobwebs of my memory: I seem to recall a school teacher or two along the way teaching us to do it this way. I further confess that, when I read the original question, I thought that “This is a Test Case” sure looked to me like it was properly capitalized for use as a title. However, if I’m reading the passage from “Chicago” correctly, the word “is” should also be capitalized, since it’s a verb, and Chicago says to capitalize verbs in titles. If that’s right, then the properly-capitalized title given as an example by the OP would read, “This Is a Test Case.” So, only the word “a,” which is an article, remains uncapitalized, because Chicago says not to capitalize articles.
So, this all gets down to how fussy one wants to be; this should be dictated by the end use of the script. Since the OP didn’t mention his/her end use, I can’t say whether these additional refinements are needed or not, for the specific use by the OP.
It’s a tradeoff between getting the most accurate title capitalization vs. how much code one wants to write (and test) to handle all the possible cases. Too, in a situation where getting accurate capitalization is of crucial importance, the script should be used only as a “time saver;” an actual human still needs to look at every title that’s produced by the script and make the final determination. (For example: What if the title contains a word like “iPad” or “iMac”? Then, the script, even with all the refinements I’m outlining here, would fail. It’d take an input line like “iPad killer apps” and probably come up with something like “IPad Killer Apps”. This is probably not what’s wanted! If I were capitalizing this “by hand,” I’d capitalize it, “iPad Killer Apps”.)
Still, the failure rate with all of the refinements I give here, and with good word list data files, should be very low.
Another problem: Some words can act as different parts of speech, depending on the context. So, just checking each input word against a word list breaks down in this type of situation. Human intelligence is probably the only answer, since the only way to know for sure what part of speech a word is, is by its context and its meaning in that context.
However, we luck out with one common situation: The word “to.” It can be used as a common preposition, as in “to the beach,” or as part of an infinitive, as in “to run.” In both instances, though, the “Chicago” capitalization rules say not to capitalize the “to.” So, even if a shell script were to misclassify the word “to” as a common preposition, when it’s really part of an infinitive, the shell script would still do the right thing. Serendipity is on our side here!)
Anyway, my thought was that it still should be possible to do a pretty darned good mechanization of lowercase-to-title-case, following the rules in “Chicago,” by just adding a few straightforward refinements to the answer you’ve already provided to the OP. In addition to checking whether a word from the input is the first or last word in the title, each word in the input title also could be tested against each word in a file containing common articles, another file containing common conjunctions, and yet another file containing common prepositions, to help to determine whether to capitalize the word. It’s not hard to find lists of common parts of speech on the web; a little reformatting and you’ve got yourself a list of, let’s say, verbs, with one verb per line, in a text file. One such web site is yourdictionary.com (look in their “English Grammar Rules & Usage” section), but I’m sure there are other good web sources, too.
For example, you could have a file named, say, “articles.txt” that contains a list of common articles, with one word per line, for easy processing in Unix. My own articles file (so far) contains just these four words:
a
an
some
the
–and so on, for a “conjunctions.txt” file and a “prepositions.txt” file. (The extension “.txt” isn’t needed in Unix for processing a text file, of course, but it sure makes things easier when you want to examine or work on one of those file in a GUI text editor, such as in TextEdit on the Mac.)
Additional refinements could include checking each word of input to see if it’s already in all caps. If so, the script could assume that that word is an acronym, and just pass that word through to the output unchanged. This would probably work well enough for most mechanization purposes.
Further refinements would include:
o Testing whether the word being processed is the first word of the title, and, if so, capitalize the first letter of the word (which you had already mentioned in your answer, Dave).
o Testing whether the word being processed is the last word of the title, and, if so, capitalize the first letter of the word.
Once a full shell script has been written and any gross errors removed, you’d want to have some simple test cases against which to test the finished script. These test cases should include:
o At least one single-word title (to verify that the program logic works correctly when the first word of the title is also the last word)
o At least one two-word title (to verify the logic when there’s a first word and a last word, but no “middle” words in the title)
o At least several titles of three or more words each (to test the cases of first word, one or more middle words, last word)
o At least some titles that have acronyms in them; preferably some test cases with an acronym at the beginning of the title, at the end, and in the middle of the title
o Titles that, collectively, contain all parts of speech (these needn’t be all in the same title, as long as all of the different part-of-speech “cases” are covered)
o Etc.
Anyways, Dave, please keep up the great work with this site!
Bruce Brown
Macintosh Computer Consultant
Exactly.
(actually, wow. quite a response!)