I am trying to write a script that can list all words of a specified length that match a specific pattern. For example, 6-letter words that start with “TH” and end with an “R”. Can you help me out? I’m on a Mac using the command line.
There are dozens of word games and puzzles that require you to figure out an “n” letter word that matches certain criteria, from crossword puzzles to Scrabble, Wordle to Keyword. Searching by definition or meaning is far more difficult, but if we just focus on words that match a specific pattern, it’s really a task perfectly suited for a shell script.
Tip: I’ve written a handy Keyword tutorial for the fun word puzzle Keyword from the Washington Post if you’re curious about how this tool is perfect for helping you puzzle out the tough ones.
The key is to have a good starting dictionary and it turns out that most Linux or related systems do have just such a file, typically located in /usr/share/dict/web2 (on Mac systems) or somewhere similar. Can’t find it? Try checking the “spell” command man page, or you can download the English language dictionary directly from the GNU “aspell” command download page.
EXAMINING THE DICTIONARY FILE
Let’s start by looking at the file /usr/share/dict/web2. Easily done:
$ head /usr/share/dict/web2 A a aa aal aalii aam Aani aardvark aardwolf Aaron
As you can see, it’s a very simple list of words, though I’m unsure whether “aa” is actually a word in the common English dictionary. In fact, you can go down quite a rabbit hole trying to ascertain the best possible dictionary for word puzzles. A starting point is a Scrabble-approved dictionary, but finding one of those online, well, that’s your task. 😃
GREP AND REGULAR EXPRESSIONS
The command that’s going to do all the heavy lifting is grep, named after the command line editor’s instruction “global (search) / regular expression / print”. Hidden in its name is its secret superpower: You can search for regular expressions rather than just simple patterns. There’s quite a complex RE language and even books written on the subject, but for our task all we need to know is that a dot represents exactly one letter and that ‘^’ indicates the beginning of the matching line while ‘$’ matches the end.
In other words, to search for exactly the match “aardvark” with no variations (like “aardvarks”), the grep regular expression would be:
^aardvark$
and if you didn’t know the second, third, and fifth letter, you would express it as:
^a..d.ark$
That’s the key. If you are good at remembering the regular expression language, you need go no further than the simple command line:
$ grep '^a..d.ark$' /usr/share/dict/web2 aardvark
But there are better and easier ways to work with a tool like this. For example, I like to utilize the command line arguments and if we use one letter per arg and dashes to represent unknown letters, a command invocation might be:
matchword a - - d - a r k
Easy to remember and use, right? Now, how to turn that command line invocation into the required regular expression…
STARTING ARGS INTO REGULAR EXPRESSION TOKENS
The shell makes it incredibly easy to work with starting arguments because you simply refer to them by ordinal position, so $1 is the first value, $2 the second, and so on. You can even embed it into a sequence, making it surprisingly easy:
grep '^$1$2$3$4$5$6$' /usr/share/dict/web2
There are more sophisticated ways to deal with arguments that would be based on the actual number of arguments specified (stored in the $# variable) but since a referenced arg that hasnt’ been specified is null, this works just fine. However, two issues:
- How do we turn ‘-‘ into ‘.’?
- What happens to variable $10 and so on?
In the former case, we can use the “tr” translate command once we’ve built the argument:
pattern="'$(echo "^$1$2$3$4$5$6$" | tr '-' '.')" grep "$pattern" /usr/share/dict/web2
In terms of the second, this is a special case: The shell will see $10$11 as $1 [1] $1 [2] so the numeric argument has to be modified, just in the case of two digit variable names, thusly:
pattern="$(echo "^$1$2$3$4$5$6$7$8$9${10}${11}$" | tr '-' '.')"
In fact, the above covers both our needs but there’s one more potential hiccup: Upper case vs. lower case. Look again at the dictionary snippet and you’ll see some words include uppercase letters. The solution is to add that to the command above:
pattern="$(echo "^$1$2$3$4$5$6$7$8$9${10}${11}$" | tr '[A-Z]' '[a-z]' | tr '-' '.')"
But… that’s still only half the task, so here’s the other step:
tr ‘[[:upper:]]’ ‘[[:lower:]]’ < $DICT | grep “$pattern”
That’s basically everything required.
THE FULL SHELL SCRIPT
You can easily drop these commands into a script, adding a usage message if no args are included and a handy “paste” invocation at the end to create multi-column output:
#!/bin/sh # Wordmatch - find matching words given letter-in-place options # Usage: "wordle-help.sh 1 2 3 4 5" # where each argument is either a dash (unknown letter) or a letter # the length of the word is implied by the number of arguments specified DICT="/usr/share/dict/web2" if [ $# -lt 5 ] ; then echo "Usage: $(basename $0) 1 2 3 4 5 [etc etc]" echo " where each argument is a dash (unknown) or a letter" exit 1 fi # the specified letters in location pattern pattern="$(echo "^$1$2$3$4$5$6$7$8$9${10}${11}$" | tr '[A-Z]' '[a-z]' | tr '-' '.')" # and let's do it! eval "tr '[[:upper:]]' '[[:lower:]]' < $DICT | grep "$pattern" | \ paste - - - - - - - " exit 0
This works just fine in the latest version of MacOS and should work great on any Linux system too, though you might need to tweak the DICT address for the online dictionary.
Testing it, here are a few examples of command usage with the script called “wordmatch.sh”:
$ sh wordmatch.sh o - s c - - - obscene obscure ooscope ooscopy $ sh wordmatch.sh - - - t h azoth baith barth beath beeth berth birth booth breth brith broth cheth cloth couth crith cruth cyath death depth dryth earth edith faith fifth filth firth forth fouth frith froth fulth garth girth grith heath illth keith leath leith lenth lewth loath lowth meith mirth month mooth morth mouth mowth neath ninth north quoth routh scyth sheth sidth sixth slath sloth smeth smith smyth snath sooth south south stith swath swith teeth tenth tilth tooth troth truth tuath walth warth width worth wrath writh wroth yarth yerth yirth youth $ sh wordmatch.sh x - - - - - xarque xenial xenian xenium xenomi xeriff xeroma xoanon xylate xylene xylina xylite xyloid xyloma xylose xyloyl xyster xystos xystum xystus
Challenge solved! Oh, and the six letter words that start with ‘th’ and end with ‘r’? It’s a matter of just a few seconds to ascertain your options are thakur, thaler, thawer, theme, thenar, thulir, thunar, and thunor.
Pro Tip: I’ve been writing about Linux since the dawn of the operating system, and Unix before that. Please check out my extensive Linux help area and Linux shell script programming area for lots of additional tutorial content while you’re visiting. Thanks!