Ask Dave Taylor
  • Facebook
  • Instagram
  • Linkedin
  • Pinterest
  • Twitter
  • YouTube
  • Home
  • YouTube Videos
  • Top Categories
  • Subscribe via Email
  • Ask A Question
  • Meet Dave
  • Home
  • Linux Shell Script Programming
  • Can I do Bulk Substitution on the Linux Command Line?

Can I do Bulk Substitution on the Linux Command Line?

January 2, 2019 / Dave Taylor / Linux Shell Script Programming / 2 Comments

I have a few hundred text files from an old research project and want to release them to a public domain site. Before I do so, however, I want to replace people’s names for privacy. So “Rick Deckard” or “Rick” or “Deckard” would become “Richard”, and so on. Help!

You can definitely do lots of text transformations directly from the command line with a Linux system, but what you’re asking about is actually a bit more tricky than it appears. The problem is that while recognizing the pattern “Rick Deckard” is easy enough, what about if there’s a line break between the two words instead of a space? Indeed, if we didn’t test it thoroughly, a simple solution might actually turn Rick Deckard into Richard Richard. Not so good (though fixable post-substitution if needed)

There’s also the question of what language or utility to tap: I’m a fan of sed, the stream editor that’s designed for just this sort of task, or awk, a more sophisticated programming environment for Linux text management, but I can already hear some readers insist that this is a task for a Perl script. Maybe so. I’m a fan of simple, however, so we’ll aim for sed and see how it evolves.

The most obvious pattern to match is “rick deckard”. That covers the easiest matches if you also indicate that you want case-insensitive matching. Space can also be replaced by a simple sequence that would match multiple spaces and tabs too, a special regex notation that means any white space. Any other punctuation will fail to match, as we’d want. In sed, that’d look like:

$ sed -e “s/Rick\sDeckard/Richard/g” inputfile

Small tweak, however: to make this maximally portable across Linux and Unix systems, the succinct \s notation should be replaced with the character class [[:space:]] instead. Let’s give this a test with a sample input file:

$ cat convertme.txt 
Police department bounty hunter Rick Deckard is assigned to retire
six androids of the highly intelligent Nexus-6 model. These androids
are difficult to detect, but Deckard hopes to earn enough bounty
money to buy a live animal to replace his lone electric sheep. Rick
Deckard visits the Rosen Association's headquarters in Seattle to
confirm the latest empathy test's accuracy. The test appears to
give a false positive on Eldon Rosen's niece, Rachael, meaning the
police have potentially been executing human beings. Rosen attempts
to blackmail Rick Deckard to get him to drop the case, but Deckard retests
Rachael and determines that Rachael is, indeed, an android.

$ sed -e "s/Rick[[:space:]]Deckard/RICHARD/g" convertme.txt 
Police department bounty hunter RICHARD is assigned to retire
six androids of the highly intelligent Nexus-6 model. These androids
are difficult to detect, but Deckard hopes to earn enough bounty
money to buy a live animal to replace his lone electric sheep. Rick
Deckard visits the Rosen Association's headquarters in Seattle to
confirm the latest empathy test's accuracy. The test appears to
give a false positive on Eldon Rosen's niece, Rachael, meaning the
police have potentially been executing human beings. Rosen attempts
to blackmail RICHARD to get him to drop the case, but Deckard retests
Rachael and determines that Rachael is, indeed, an android.

As expected, you can see that it matched the first occurance of the full name (easy) and the second near the bottom which includes a tab instead of a space (not quite as easy). It failed to identify the name wrapped across the end of the 4th line and beginning of 5th, but we’re going to defer that solution a bit.

To ensure it catches standalone uses of Rick and Deckard too, we’re going to simply add additional expressions to the sed invocation itself. Watch how this improves the result:

$ sed -e "s/Rick[[:space:]]Deckard/RICHARD/g;s/Rick/RICHARD2/g;
  s/Deckard/RICHARD3/g" convertme.txt 
Police department bounty hunter RICHARD is assigned to retire
six androids of the highly intelligent Nexus-6 model. These androids
are difficult to detect, but RICHARD3 hopes to earn enough bounty
money to buy a live animal to replace his lone electric sheep. RICHARD2
RICHARD3 visits the Rosen Association's headquarters in Seattle to
confirm the latest empathy test's accuracy. The test appears to
give a false positive on Eldon Rosen's niece, Rachael, meaning the
police have potentially been executing human beings. Rosen attempts
to blackmail RICHARD to get him to drop the case, but RICHARD3 retests
Rachael and determines that Rachael is, indeed, an android.

So you can see what happens, I added a numeric suffix so matching rules are obvious. This works pretty darn well, actually, though the RICHARD2 RICHARD3 sequence is problematic. We’ll come back to it, however, let’s just move forward with the bigger issue of wanting a file of name substitutions.

Since sed can read a file for its instructions through use of the -f file invocation, that means this set of basic rules can be neatly dropped into a file thusly:

s/Rick[[:space:]]Deckard/RICHARD/g
s/Rick/RICHARD/g
s/Deckard/RICHARD/g

And that means we can write another script to produce this sequence of commands, one that starts with a sequence like “Rick Deckard:RICHARD” and produces what’s needed. Then you can add “Eldon Tyrell:EDWARD”, “Rachael:ROSIE”, and so on. That’s way beyond the space I have for this article, however, so let’s just stop here. I’ve shown you the basics of producing sed name substitutions and you can go from here.

In terms of the RICHARD2 RICHARD3 problem? My solution would be to try using tr to translate all \n to a single space, then search and substitute all occurrences of RICHARD2 RICHARD3 (or, perhaps, RICHARD RICHARD) with a single word. Then push the text back through fmt to get it back to rational line lengths. Works really well in most cases with text (txt) files!

Pro Tip: While you’re here, please check out our extensive linux scripting help pages!

About the Author: Dave Taylor has been involved with the online world since the early days of the Internet. Author of over 20 technical books, he runs the popular AskDaveTaylor.com tech help site. You can also find his gadget reviews on YouTube and chat with him on Twitter as @DaveTaylor.

Let’s Stay In Touch!

Never miss a single article, review or tutorial here on AskDaveTaylor, sign up for my fun weekly newsletter!
Name: 
Your email address:*
Please enter all required fields
Correct invalid entries
No spam, ever. Promise. Powered by FeedBlitz
Please choose a color:
Starbucks coffee cup I do have a lot to say, and questions of my own for that matter, but first I'd like to say thank you, Dave, for all your helpful information by buying you a cup of coffee!
bash, blade runner, command line programming, linux, script, sed program, shell, shell script, unix

2 comments on “Can I do Bulk Substitution on the Linux Command Line?”

  1. Bob Rankin says:
    January 2, 2019 at 12:15 pm

    That should work fine, as long as there are no Rickshaws or Rickrolls in the input file. 🙂

    Reply
    • Dave Taylor says:
      January 2, 2019 at 7:56 pm

      Yes, the best way to match words is to ensure that there are no alpha adjacent, either before or after. But that’s more of a book length piece! 🙂

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Recent Posts

  • How Can I Improve Browser Performance on a Windows PC?
  • How to Scan and Create QR Codes in Android 13
  • Switch Twitter Account from Text 2-Factor to Auth App
  • Guide to Adding a Network Printer to a Linux System
  • How to Use the ChromeOS “Trash” Feature

On Our YouTube Channel

Samsung Galaxy S23 Android Smartphone -- EXTENSIVE DEMOS & REVIEW

Monoprice Dark Matter SENTRY Streaming Microphone -- REVIEW

Categories

  • AdSense, AdWords, and PPC Help (106)
  • Amazon, eBay, and Online Shopping Help (164)
  • Android Help (228)
  • Apple iPad Help (147)
  • Apple Watch Help (53)
  • Articles, Tutorials, and Reviews (346)
  • Auto Tech Help (17)
  • Business Advice (200)
  • ChromeOS Help (32)
  • Computer & Internet Basics (780)
  • d) None of the Above (166)
  • Facebook Help (384)
  • Google, Chrome & Gmail Help (188)
  • HTML & Web Page Design (247)
  • Instagram Help (49)
  • iPhone & iOS Help (623)
  • iPod & MP3 Player Help (173)
  • Kindle & Nook Help (99)
  • LinkedIn Help (88)
  • Linux Help (174)
  • Linux Shell Script Programming (90)
  • Mac & MacOS Help (913)
  • Most Popular (16)
  • Outlook & Office 365 Help (33)
  • PayPal Help (68)
  • Pinterest Help (54)
  • Reddit Help (19)
  • SEO & Marketing (82)
  • Spam, Scams & Security (96)
  • Trade Show News & Updates (23)
  • Twitter Help (222)
  • Video Game Tips (66)
  • Web Site Traffic Tips (62)
  • Windows PC Help (950)
  • Wordpress Help (206)
  • Writing and Publishing (72)
  • YouTube Help (47)
  • YouTube Video Reviews (159)
  • Zoom, Skype & Video Chat Help (62)

Archives

Social Connections:

Ask Dave Taylor


Follow Me on Pinterest
Follow me on Twitter
Follow me on LinkedIn
Follow me on Instagram


AskDaveTaylor on Facebook



microsoft insider mvp


This web site is for the purpose of disseminating information for educational purposes, free of charge, for the benefit of all visitors. We take great care to provide quality information. However, we do not guarantee, and accept no legal liability whatsoever arising from or connected to, the accuracy, reliability, currency or completeness of any material contained on this site or on any linked site. Further, please note that by submitting a question or comment you're agreeing to our terms of service, which are: you relinquish any subsequent rights of ownership to your material by submitting it on this site. Our lawyer says "Thanks for your cooperation."
© 2023 by Dave Taylor. "Ask Dave Taylor®" is a registered trademark of Intuitive Systems, LLC.
Privacy Policy - Terms and Conditions - Accessibility Policy