How can I filter robot crawler hits out of my Apache access_log file?
On a mailing list I'm on, a member recently asked: "Due to an abusive web crawler, I now have a 230MB Apache access_log file on my Web server. I tried to trim it down using grep, but I don't have enough disk space for the command to succeed. Help!"
This is a common problem with log files on Unix and Linux servers, actually. Among the many files that often grown without bounds are the Apache access_log and error_log files and the system /var/log/messages file (on some systems it's called system.log, but it's the same file). If you don't pay attention, these files can quickly grow to be tens, hundreds, or even thousands of megabytes.
Once you have these huge log files that are eating up a significant percentage of your available disk space, your choices are quite limited, as you have learned.
If you have space, the obvious way to weed out the web crawler hits, assuming that you know a unique string that identifies those queries, is to do:
$ grep -v ptrn access_log > new_access_logHowever, you don't have space, so here's how I would handle this sticky situation...
First, move the file to a new name:
$ mv access_log bad_access_logThis first step lets you stop the access_log from growing even bigger while you're working on it, then:
$ gzip bad_access_logcreates a compressed '.gz' version of the file that should be about 50% smaller. Now you've probably just freed up about 100MB of space, so you should be able to do something like this:
$ zcat bad_access_log.gz | grep -v ptrn > good_access_logThe zcat command (actually a link to the gzip program, but that's just useless geeky info you can safely ignore) uncompresses the file, but since it feeds the uncompressed result directly to the command pipe, there's no need to reclaim the extra disk space as it processes the data.
If that STILL doesn't work, you could also try:
$ zcat bad_access_log.gz | \ grep -v ptrn | gzip > good_access_log.gzOnce that's done, remove the bad file and uncompress the good one with:
$ rm bad_access_log* $ gunzip good*gzand you should be good to go!
Related Unix and Linux Help articles:
✔ Copy and Paste from the Mac OS X Command Line?
I am constantly running commands in Terminal.app on my MacBook and then copying and pasting the results into email messages or documents. Yes,...✔ Shell script to convert lowercase to title case?
As part of a project I'm working on, I find myself deep in a Linux shell script, needing to have a subroutine that...✔ Can I script renaming files based on an XML data map?
I have a folder full of files which are named with four digits and a file extension e.g. 0312.file and an XML-file describing...✔ Test for valid numbers in a Bash shell script?
In a different discussion on this site [see Redirecting input in a shell script] a visitor commented that "I was too busy trying...✔ Review: iSSH for the iPad/iPhone
If you're running an online business like I am, there are times when you need to connect and log in to the server...
Let's stay in touch!
Sign up for my weekly AskDaveTaylor Newsletter and you'll receive even more tech and gadget help right to your inbox, along with exclusive news and industry updates. It's good stuff. I promise!
I do have a comment, now that you mention it!
Check This Out Too...
Look for Answers
All Our Categories
Apple iPad Help
Articles and Reviews
Auctions and Online Shopping
Blogs and Blogging
Building Web Site Traffic
Business and Management
Computer and Internet Basics
d) None of the Above
Google Gmail Help
Google Plus Help
Industry News and Trade Shows
iPhone and Cell Phone Help
iPod, Sony PSP and MP3 Player Help
Kindle Fire Help
Mac OS X Help
Pay Per Click (PPC) Advertising
Search Engine Optimization (SEO)
Shell Script Programming
Tech Support Video Help
The Writing Business
Twitter, LinkedIn and Social Network Help
Unix and Linux Help
Video Game Tips and Help
Windows PC Help
Find Me on Google+
ADT on G+