Industry guru Dave Taylor offers tech support on technical and business topics, including iPhone, iPod, Microsoft Windows, Sony PSP, cellphones, online advertising, CSS, Web design, business, Unix, Linux, SEO, Mac OS X, and shell script programming.     


How do I search lots of files at once?

Dave, I need to conduct a number of searches through more than 2500 text files. Each search is for a different specific text string. The text files are on my Mac hard drive (Mac OS X 10.3.7) and are arranged into folders within folders within folders. I want the result in a new text file. I think I should be able to do this using some sort of grep or script but cannot figure out how to do it. Please help.

Dave's Answer:

Ah, the serendipity is marvelous! I've just been writing about the find command for the new Tiger edition of my best-selling book Learning Unix for Mac OS X so it's all very fresh in my mind.

Whenever you have a nested file heirarchy that you want to search, you should always, automatically reach for the find monkey wrench, coupled with its partner command xargs. But let's step through this slowly so you can see how these all work together, because we're going to use three different commands in a pipe to accomplish what you seek.

First off, the find command has some of the weirdest syntax in Unix, so if you want to learn more about it, use the man find command within Terminal. For now, just follow along. :-) To find all files below the current point in the file system that are HTML files, you'd use:

$ find . -name "*html" -print

Notice that by not using the pattern *.html this also matches files that have the suffix "shtml" too (typically server-side include HTML). This generates a long list of filenames. To search through them for a specific pattern, you want to use the grep command, as you know, but the wrinkle is that you can't just do something like find | grep because grep just isn't expecting a list of filenames from standard input (stdin).

That's when our pal xargs comes in. The xargs command does expect a list of filenames from stdin and it then acts as a wrapper for Unix commands that don't work that way.

Putting them all together, here's how you could find all HTML files that have a 2004 copyright notice in them, just as a topical example:

$ find . -name "*html" -print | xargs grep 2004 | \
  grep '©'

Or, if you want to get fancy, use grep -E '(\© .* 2004)' which is probably better, but it's a bit more complex because it's a regular expression not just a simple pair of patterns. Either way, the result will be a list of filenames and the lines from those files that match.

Now, the last step of your task is to save that output to a new file, which can be done with a simple redirect:

$ find . -name "*html" -print | xargs grep 2004 | \
  grep '©' > copyright.2004.txt

I hope that gets you going. Drop that into a shell script then you tweak it to meet the specific pattern and output file naming scheme you need to use.


More Useful Unix and Linux Help Articles:
✔   Copy and Paste from the Mac OS X Command Line?
I am constantly running commands in Terminal.app on my MacBook and then copying and pasting the results into email messages or documents. Yes,...
✔   Shell script to convert lowercase to title case?
As part of a project I'm working on, I find myself deep in a Linux shell script, needing to have a subroutine that...
✔   Can I script renaming files based on an XML data map?
I have a folder full of files which are named with four digits and a file extension e.g. 0312.file and an XML-file describing...
✔   Test for valid numbers in a Bash shell script?
In a different discussion on this site [see Redirecting input in a shell script] a visitor commented that "I was too busy trying...
✔   Review: iSSH for the iPad/iPhone
If you're running an online business like I am, there are times when you need to connect and log in to the server...

Let's stay in touch!
Sign up for my weekly AskDaveTaylor Newsletter and you'll receive even more tech and gadget help right to your inbox, along with exclusive news and industry updates. It's good stuff. I promise!
    Enter your name: and your email addr:  









Reader Comments To Date: 6

Anacreo said, on January 15, 2005 3:17 AM:

I'd also recommend throwing a sed in there to take care of files with spaces in there:

find ./ -name "*html" -print | sed -e 's/.*/"&"/' | xargs grep '2004' | grep '©'

the sed -e 's/.*/"&"/g' will wrap each line in double quotes:

~ $ ls | sed -e 's/.*/"&"/'
"Desktop"
"DiscBlaze Temp Folder"
"Documents"
...

Nice Post btw... people forget the wizardry of [Uu]nix

Gary W. Longsine said, on May 26, 2005 12:14 AM:

Now that Tiger (Mac OS X 10.4) has shipped, people who find this helpful article on the net will want to know that a dramatically faster command line tools is included with Tiger.

mdfind uses the Spotlight index. Life has suddenly become too short to wait for the old fashioned UNIX find -- mdfind rocks.

If you can't remember how to use mdfind, just type it with no parameters, and it provides nice examples, like this:

$ mdfind
mdfind: no query specified.

Usage: mdfind [-live] [-onlyin directory] query
list the files matching the query
query can be an expression or a sequence of words

-live Query should stay active
-onlyin Search only within given directory

-0 Use NUL (``\0'') as a path separator, for use with xargs -0.

example: mdfind image
example: mdfind "kMDItemAuthor == '*MyFavoriteAuthor*'"
example: mdfind -live MyFavoriteAuthor
--- end of usage ---

Mike Kirby said, on May 30, 2007 1:43 AM:

As an addendum to the above comment: be aware the the spotlight indexes are not necessarily complete or accurate. There are directories and files that do not get included in the index. That's the tradeoff you get for the speed... instead of giving you true, current information about your system, Spotlight gives you incomplete, not always-up-to-date information about your system.

But on the upside, it fails to give you accurate information much faster than anything else out there!

BTW, when I tried them, none of the command lines above give me anything but "xargs: unterminated quote". The quest to be able to search my hard drive continues...

Mike Kirby said, on May 30, 2007 1:50 AM:

Question: Why is the technique in this article better than just:

$cd TextfileDirectoryName
$sudo grep -lr SearchString *.html >OutputTextFile.txt

?

Thanks,
Mike

Deepu said, on July 13, 2011 1:38 AM:

Hi,

I have a find command which lists files older than 1 year. As the total number of files being picked up is huge, the find command fails saying it cannot list the files.

Is it ok to use the find command inside a for loop?

Dave Taylor said, on July 13, 2011 1:15 PM:

You can use the find command within a loop, Deepu, but if there are too many files, you're hitting the buffer limit within the shell and that won't solve the problem. Try having those more than 18 months old and another search for 12-18 months, for example, to try and chop things down more narrowly.

Starbucks coffee cup I do have a lot to say, and questions of my own for that matter, but first I'd like to say thank you, Dave, for all your helpful information by buying you a cup of coffee!

I do have a comment, now that you mention it!











I will never send you any unsolicited email. Ever.






Check This Out Too...

 
Look for Answers
Need Help? Ask Dave Taylor!
Powered By
Linux Journal: Free Issue!


Follow Me on Pinterest

Find Me on Google+
ADT on G+
© 2002 - 2013 by Dave Taylor. All Rights Reserved.

Note: This web site is for the purpose of disseminating information for educational purposes, free of charge, for the benefit of all visitors. We take great care to provide quality information. However, we do not guarantee, and accept no legal liability whatsoever arising from or connected to, the accuracy, reliability, currency or completeness of any material contained on this web site or on any linked site. Further, please note that by submitting a question or comment you're agreeing to my terms of service, which are: you relinquish any subsequent rights of ownership to your material by submitting it on this site. My lawyer says "Thanks".
"Ask Dave Taylor®" is a registered trademark of Intuitive Systems, LLC.