Can I do Bulk Substitution on the Linux Command Line?

I have a few hundred text files from an old research project and want to release them to a public domain site. Before I do so, however, I want to replace people’s names for privacy. So “Rick Deckard” or “Rick” or “Deckard” would become “Richard”, and so on. Help!

You can definitely do lots of text transformations directly from the command line with a Linux system, but what you’re asking about is actually a bit more tricky than it appears. The problem is that while recognizing the pattern “Rick Deckard” is easy enough, what about if there’s a line break between the two words instead of a space? Indeed, if we didn’t test it thoroughly, a simple solution might actually turn Rick Deckard into Richard Richard. Not so good (though fixable post-substitution if needed)

There’s also the question of what language or utility to tap: I’m a fan of sed, the stream editor that’s designed for just this sort of task, or awk, a more sophisticated programming environment for Linux text management, but I can already hear some readers insist that this is a task for a Perl script. Maybe so. I’m a fan of simple, however, so we’ll aim for sed and see how it evolves.

The most obvious pattern to match is “rick deckard”. That covers the easiest matches if you also indicate that you want case-insensitive matching. Space can also be replaced by a simple sequence that would match multiple spaces and tabs too, a special regex notation that means any white space. Any other punctuation will fail to match, as we’d want. In sed, that’d look like:

$ sed -e “s/Rick\sDeckard/Richard/g” inputfile

Small tweak, however: to make this maximally portable across Linux and Unix systems, the succinct \s notation should be replaced with the character class [[:space:]] instead. Let’s give this a test with a sample input file:

$ cat convertme.txt 
Police department bounty hunter Rick Deckard is assigned to retire
six androids of the highly intelligent Nexus-6 model. These androids
are difficult to detect, but Deckard hopes to earn enough bounty
money to buy a live animal to replace his lone electric sheep. Rick
Deckard visits the Rosen Association's headquarters in Seattle to
confirm the latest empathy test's accuracy. The test appears to
give a false positive on Eldon Rosen's niece, Rachael, meaning the
police have potentially been executing human beings. Rosen attempts
to blackmail Rick Deckard to get him to drop the case, but Deckard retests
Rachael and determines that Rachael is, indeed, an android.

$ sed -e "s/Rick[[:space:]]Deckard/RICHARD/g" convertme.txt 
Police department bounty hunter RICHARD is assigned to retire
six androids of the highly intelligent Nexus-6 model. These androids
are difficult to detect, but Deckard hopes to earn enough bounty
money to buy a live animal to replace his lone electric sheep. Rick
Deckard visits the Rosen Association's headquarters in Seattle to
confirm the latest empathy test's accuracy. The test appears to
give a false positive on Eldon Rosen's niece, Rachael, meaning the
police have potentially been executing human beings. Rosen attempts
to blackmail RICHARD to get him to drop the case, but Deckard retests
Rachael and determines that Rachael is, indeed, an android.

As expected, you can see that it matched the first occurance of the full name (easy) and the second near the bottom which includes a tab instead of a space (not quite as easy). It failed to identify the name wrapped across the end of the 4th line and beginning of 5th, but we’re going to defer that solution a bit.

To ensure it catches standalone uses of Rick and Deckard too, we’re going to simply add additional expressions to the sed invocation itself. Watch how this improves the result:

$ sed -e "s/Rick[[:space:]]Deckard/RICHARD/g;s/Rick/RICHARD2/g;
  s/Deckard/RICHARD3/g" convertme.txt 
Police department bounty hunter RICHARD is assigned to retire
six androids of the highly intelligent Nexus-6 model. These androids
are difficult to detect, but RICHARD3 hopes to earn enough bounty
money to buy a live animal to replace his lone electric sheep. RICHARD2
RICHARD3 visits the Rosen Association's headquarters in Seattle to
confirm the latest empathy test's accuracy. The test appears to
give a false positive on Eldon Rosen's niece, Rachael, meaning the
police have potentially been executing human beings. Rosen attempts
to blackmail RICHARD to get him to drop the case, but RICHARD3 retests
Rachael and determines that Rachael is, indeed, an android.

So you can see what happens, I added a numeric suffix so matching rules are obvious. This works pretty darn well, actually, though the RICHARD2 RICHARD3 sequence is problematic. We’ll come back to it, however, let’s just move forward with the bigger issue of wanting a file of name substitutions.

Since sed can read a file for its instructions through use of the -f file invocation, that means this set of basic rules can be neatly dropped into a file thusly:

s/Rick[[:space:]]Deckard/RICHARD/g
s/Rick/RICHARD/g
s/Deckard/RICHARD/g

And that means we can write another script to produce this sequence of commands, one that starts with a sequence like “Rick Deckard:RICHARD” and produces what’s needed. Then you can add “Eldon Tyrell:EDWARD”, “Rachael:ROSIE”, and so on. That’s way beyond the space I have for this article, however, so let’s just stop here. I’ve shown you the basics of producing sed name substitutions and you can go from here.

In terms of the RICHARD2 RICHARD3 problem? My solution would be to try using tr to translate all \n to a single space, then search and substitute all occurrences of RICHARD2 RICHARD3 (or, perhaps, RICHARD RICHARD) with a single word. Then push the text back through fmt to get it back to rational line lengths. Works really well in most cases with text (txt) files!

Pro Tip: While you’re here, please check out our extensive linux scripting help pages!

About the Author: Dave Taylor has been involved with the online world since the early days of the Internet. Author of over 20 technical books, he runs the popular AskDaveTaylor.com tech help site. You can also find his gadget reviews on YouTube and chat with him on Twitter as @DaveTaylor.

bash, blade runner, command line programming, linux, script, sed program, shell, shell script, unix

2 comments on “Can I do Bulk Substitution on the Linux Command Line?”

Bob Rankin says:

January 2, 2019 at 12:15 pm

That should work fine, as long as there are no Rickshaws or Rickrolls in the input file. 🙂

- Dave Taylor says:
  
  January 2, 2019 at 7:56 pm
  
  Yes, the best way to match words is to ensure that there are no alpha adjacent, either before or after. But that’s more of a book length piece! 🙂

2 comments on “Can I do Bulk Substitution on the Linux Command Line?”

Leave a Reply Cancel reply