I have a few hundred text files from an old research project and want to release them to a public domain site. Before I do so, however, I want to replace people’s names for privacy. So “Rick Deckard” or “Rick” or “Deckard” would become “Richard”, and so on. Help!
You can definitely do lots of text transformations directly from the command line with a Linux system, but what you’re asking about is actually a bit more tricky than it appears. The problem is that while recognizing the pattern “Rick Deckard” is easy enough, what about if there’s a line break between the two words instead of a space? Indeed, if we didn’t test it thoroughly, a simple solution might actually turn Rick Deckard into Richard Richard. Not so good (though fixable post-substitution if needed)
There’s also the question of what language or utility to tap: I’m a fan of sed, the stream editor that’s designed for just this sort of task, or awk, a more sophisticated programming environment for Linux text management, but I can already hear some readers insist that this is a task for a Perl script. Maybe so. I’m a fan of simple, however, so we’ll aim for sed and see how it evolves.
The most obvious pattern to match is “rick deckard”. That covers the easiest matches if you also indicate that you want case-insensitive matching. Space can also be replaced by a simple sequence that would match multiple spaces and tabs too, a special regex notation that means any white space. Any other punctuation will fail to match, as we’d want. In sed, that’d look like:
$ sed -e “s/Rick\sDeckard/Richard/g” inputfile
Small tweak, however: to make this maximally portable across Linux and Unix systems, the succinct \s notation should be replaced with the character class [[:space:]] instead. Let’s give this a test with a sample input file:
$ cat convertme.txt Police department bounty hunter Rick Deckard is assigned to retire six androids of the highly intelligent Nexus-6 model. These androids are difficult to detect, but Deckard hopes to earn enough bounty money to buy a live animal to replace his lone electric sheep. Rick Deckard visits the Rosen Association's headquarters in Seattle to confirm the latest empathy test's accuracy. The test appears to give a false positive on Eldon Rosen's niece, Rachael, meaning the police have potentially been executing human beings. Rosen attempts to blackmail Rick Deckard to get him to drop the case, but Deckard retests Rachael and determines that Rachael is, indeed, an android. $ sed -e "s/Rick[[:space:]]Deckard/RICHARD/g" convertme.txt Police department bounty hunter RICHARD is assigned to retire six androids of the highly intelligent Nexus-6 model. These androids are difficult to detect, but Deckard hopes to earn enough bounty money to buy a live animal to replace his lone electric sheep. Rick Deckard visits the Rosen Association's headquarters in Seattle to confirm the latest empathy test's accuracy. The test appears to give a false positive on Eldon Rosen's niece, Rachael, meaning the police have potentially been executing human beings. Rosen attempts to blackmail RICHARD to get him to drop the case, but Deckard retests Rachael and determines that Rachael is, indeed, an android.
As expected, you can see that it matched the first occurance of the full name (easy) and the second near the bottom which includes a tab instead of a space (not quite as easy). It failed to identify the name wrapped across the end of the 4th line and beginning of 5th, but we’re going to defer that solution a bit.
To ensure it catches standalone uses of Rick and Deckard too, we’re going to simply add additional expressions to the sed invocation itself. Watch how this improves the result:
$ sed -e "s/Rick[[:space:]]Deckard/RICHARD/g;s/Rick/RICHARD2/g; s/Deckard/RICHARD3/g" convertme.txt Police department bounty hunter RICHARD is assigned to retire six androids of the highly intelligent Nexus-6 model. These androids are difficult to detect, but RICHARD3 hopes to earn enough bounty money to buy a live animal to replace his lone electric sheep. RICHARD2 RICHARD3 visits the Rosen Association's headquarters in Seattle to confirm the latest empathy test's accuracy. The test appears to give a false positive on Eldon Rosen's niece, Rachael, meaning the police have potentially been executing human beings. Rosen attempts to blackmail RICHARD to get him to drop the case, but RICHARD3 retests Rachael and determines that Rachael is, indeed, an android.
So you can see what happens, I added a numeric suffix so matching rules are obvious. This works pretty darn well, actually, though the RICHARD2 RICHARD3 sequence is problematic. We’ll come back to it, however, let’s just move forward with the bigger issue of wanting a file of name substitutions.
Since sed can read a file for its instructions through use of the -f file invocation, that means this set of basic rules can be neatly dropped into a file thusly:
s/Rick[[:space:]]Deckard/RICHARD/g s/Rick/RICHARD/g s/Deckard/RICHARD/g
And that means we can write another script to produce this sequence of commands, one that starts with a sequence like “Rick Deckard:RICHARD” and produces what’s needed. Then you can add “Eldon Tyrell:EDWARD”, “Rachael:ROSIE”, and so on. That’s way beyond the space I have for this article, however, so let’s just stop here. I’ve shown you the basics of producing sed name substitutions and you can go from here.
In terms of the RICHARD2 RICHARD3 problem? My solution would be to try using tr to translate all \n to a single space, then search and substitute all occurrences of RICHARD2 RICHARD3 (or, perhaps, RICHARD RICHARD) with a single word. Then push the text back through fmt to get it back to rational line lengths. Works really well in most cases with text (txt) files!
Pro Tip: While you’re here, please check out our extensive linux scripting help pages!
That should work fine, as long as there are no Rickshaws or Rickrolls in the input file. 🙂
Yes, the best way to match words is to ensure that there are no alpha adjacent, either before or after. But that’s more of a book length piece! 🙂