Industry guru Dave Taylor offers free tech support on a wide variety of technical and business topics, including HTML, Apple iPhone, online advertising, Cascading Style Sheets, Web design, management, Unix, Linux, search engine optimization, online dating, Mac OS X, shell script programming and Microsoft Windows.

How can I filter robot crawler hits out of my Apache access_log file?

On a mailing list I'm on, a member recently asked: "Due to an abusive web crawler, I now have a 230MB Apache access_log file on my Web server. I tried to trim it down using grep, but I don't have enough disk space for the command to succeed. Help!"

Dave's Answer:

This is a common problem with log files on Unix and Linux servers, actually. Among the many files that often grown without bounds are the Apache access_log and error_log files and the system /var/log/messages file (on some systems it's called system.log, but it's the same file). If you don't pay attention, these files can quickly grow to be tens, hundreds, or even thousands of megabytes.

Once you have these huge log files that are eating up a significant percentage of your available disk space, your choices are quite limited, as you have learned.

If you have space, the obvious way to weed out the web crawler hits, assuming that you know a unique string that identifies those queries, is to do:

$ grep -v ptrn access_log > new_access_log
However, you don't have space, so here's how I would handle this sticky situation...

First, move the file to a new name:

$ mv access_log bad_access_log
This first step lets you stop the access_log from growing even bigger while you're working on it, then:
$ gzip bad_access_log
creates a compressed '.gz' version of the file that should be about 50% smaller. Now you've probably just freed up about 100MB of space, so you should be able to do something like this:
$ zcat bad_access_log.gz | grep -v ptrn > good_access_log
The zcat command (actually a link to the gzip program, but that's just useless geeky info you can safely ignore) uncompresses the file, but since it feeds the uncompressed result directly to the command pipe, there's no need to reclaim the extra disk space as it processes the data.

If that STILL doesn't work, you could also try:

$ zcat bad_access_log.gz | \
 grep -v ptrn | gzip > good_access_log.gz
Once that's done, remove the bad file and uncompress the good one with:
$ rm bad_access_log*
$ gunzip good*gz
and you should be good to go!


Help others find this article at Del.icio.us, Digg, Netscape, Reddit, and Stumble Upon    

Subscribe!

Never miss another useful Q&A article again! Subscribe to AskDaveTaylor with Google Reader.

Comments

Worth noting is that some Unix and Linux systems have an additional command called zgrep and if you have that, then the sequence of

zcat bad_access_log.gz | grep -v ptrn | gzip > good_access_log.gz

can be simplified to

zgrep -v ptrn bad_access_log.gz | gzip > good_access_log.gz

Posted by: Dave Taylor at December 4, 2004 1:55 PM

I have something to say, now that you mention it, but ...
Starbucks coffee cup I do have a lot to say, and questions of my own for that matter, but first I'd like to say thank you for all your efforts on this Web site by buying you a cup of coffee!

I do have a comment, now that you mention it!











Remember personal info?


Please note that I will never send you any unsolicited email. Ever.

While I'm at it, please note that by submitting a question or comment you're agreeing to my terms of service, which are: you relinquish any subsequent rights of ownership to your material by submitting it on this site.








Ask Dave Taylor: The iPhone App: Advertisement



Follow me on Twitter @DaveTaylor

Search
Find just the answers you seek from among our 2300+ free tech support articles by using our Lijit search engine.


Linux Journal: Free Issue!

Help!





Subscribe to
Ask Dave Taylor!

Add to Google Reader
Add to My Yahoo!
Subscribe in NewsGator Online

RDF   XML

Free Updates!
Sign up and get free weekly updates and special offers on books, seminars, workshops and more.


Recent Entries
Book Links
© 2002 - 2010 by Dave Taylor. All Rights Reserved.

Note: This web site is for the purpose of disseminating information for educational purposes, free of charge, for the benefit of all visitors. We take great care to provide quality information. However, we do not guarantee, and accept no legal liability whatsoever arising from or connected to, the accuracy, reliability, currency or completeness of any material contained on this web site or on any linked site.

[whiteboard marker tray]
"Ask Dave Taylor®" is a registered trademark of Intuitive Systems, LLC.