Ask Dave Taylor
  • Facebook
  • Instagram
  • Linkedin
  • Pinterest
  • Twitter
  • YouTube
  • Home
  • YouTube Videos
  • Top Categories
  • Subscribe via Email
  • Ask A Question
  • Meet Dave
  • Home
  • HTML & Web Page Design
  • How do I find out what searches people did to end up on my Web site?

How do I find out what searches people did to end up on my Web site?

June 4, 2009 / Dave Taylor / HTML & Web Page Design, Linux Shell Script Programming, Wordpress Help / 7 Comments

I’ve been trying to figure out whether there’s a way that I can automate digging through the “referrals” on my site so I can see what searches people did to end up on one of my Web site pages. I’m running a Linux server and have Apache installed, so I get a huge log file with tons of info. But what I’d love is a simple script that will let me get email once a week with a sorted list of what searches people did to get to me. Doable?

There are lots of great applications that you can install on your server to get traffic statistics, programs that are going to do a far better job letting you visualize what’s going on than anything you can cobble together in a shell script. Further, there are also great utilities like Google Analytics that are free and quite easy to hook in (see: adding Google Analytics to your Web site).
You had a pretty specific request, however, so let’s have a look at how we could dig through the Apache log file to identify which hits are directly from Google and then how to extract them so that you get a clean summary in your mailbox.
First off, to have something run on a regular schedule, we’ll use the cron facility in Linux. It’s one of the very best features of a Linux system and if you have a Linux system, learning crontab is time very, very well spent.
But let’s start at the beginning. You’ll need to find where Apache is storing your log files, then you can just start out by searching for “google.com” with “grep”. The output lines are llooonnnggg:

$ grep google.com /home/www/logs/askdavetaylor.com-access_log | head -1
98.203.51.79 – – [01/Jun/2009:18:01:00 -0600] “GET /how_does_ebay_actually_work.html HTTP/1.1” 200 34599
“http://www.google.com/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&
channel=s&hl=en&q=how+does+ebay+work&btnG=Google+Search”
“Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10”

(I’ve added line breaks so it’s more readable, but the output is one long line in reality)
As you can see, there are many fields in this output, separated by spaces. If you count, space by space, you’ll see that the REFERRER field is #11, so we can isolate it by using the “cut” command:

$ grep google.com /home/taylor/www/logs/askdavetaylor.com-access_log | cut -f11 -d\ | head -1
“http://www.google.com/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial &channel=s&
hl=en&q=how+does+ebay+work&btnG=Google+Search”

That’s a bit more readable. Now let’s go further and observe that Google queries are name=value pairs separated by an ampersand (as are, of course, all CGI query URLs). Let’s break the URL down and see what we get:

$ grep google.com /home/taylor/www/logs/askdavetaylor.com-access_log | cut -f11 -d\ | head -1 | tr ‘&’ ‘\012’
“http://www.google.com/search?client=firefox-a
rls=org.mozilla%3Aen-US%3Aofficial
channel=s
hl=en
q=how+does+ebay+work
btnG=Google+Search”

One more step and I think, by George, we have something:

$ grep google.com /home/taylor/www/logs/askdavetaylor.com-access_log |cut -f11 -d\ | head -1 | tr ‘&’ ‘\012’ | grep “q=”
q=how+does+ebay+work

One heck of a command for a small bit of output, but once we tweak the “head -1” which has let us just work with one match, we can now quickly see, say, the 20 most recent searches (“head -20”):

q=how+does+ebay+work
q=converting+wma+files+to+mp3
q=i+made+up+a+yahoo+email+address+for+myspace+now+i+cant+delete+it
q=can+i+use+two+wireless+routers
aq=0
oq=can+i+use+two+wireless+”
q=how+do+you+put+music+on+a+psp
aq=0
oq=how+do+you+put+music
aq=0
oq=ask+dave
q=ask+dave+taylor”

Uh oh, looks like that “grep” pattern isn’t sufficiently isolating. Instead we’ll try “^q=” and the results are more what we seek:

q=how+does+ebay+work
q=converting+wma+files+to+mp3
q=i+made+up+a+yahoo+email+address+for+myspace+now+i+cant+delete+it
q=can+i+use+two+wireless+routers
q=how+do+you+put+music+on+a+psp
q=ask+dave+taylor”
q=build+web+page+to+embed+youtube
q=parallels+cant+see+my+other+partition+
q=psp+won%27t+play+games+because+it+says+they+corrupted,+what+to+do%3F
q=broken+psp+screen
q=comcast+remote+codes”
q=how+can+i+get+on+myspace+at+school
q=iphone+photos+to+mac
q=installing+windows+on+bootcamp
q=how+to+get+spades+on+myspace
q=sony+psp+warranty+claim
q=how+to+download+music+to+psp

Interesting, but what about getting a useful report from it? We need to clean things up a bit (remove the “q=” and replace ‘+’ with ‘ ‘) and we need to sort and tally things so that we can see the most common searches rather than every single search. This is done with “sed” and the power combination of “sort | uniq -c | sort -rn”:

$ grep google.com /home/taylor/www/logs/askdavetaylor.com-access_log |cut -f11 -d\   |  head -20 | 
tr '&' '\012' | grep "^q=" | sed 's/q=//;s/+/ /g' | sort | uniq -c | sort -rn 1 sony psp warranty claim 1 psp won%27t play games because it says they corrupted, what to do%3F 1 parallels cant see my other partition 1 iphone photos to mac 1 installing windows on bootcamp 1 i made up a yahoo email address for myspace now i cant delete it 1 how to get spades on myspace 1 how to download music to psp 1 how does ebay work 1 how do you put music on a psp 1 how can i get on myspace at school 1 converting wma files to mp3 1 comcast remote codes" 1 can i use two wireless routers 1 build web page to embed youtube 1 broken psp screen 1 ask dave taylor"

Still a few things to tweak, but let’s finally strip out that “head” and look at all the searches people have done to get to the site:

 108 convert wma to mp3
43 myspace at school
34 windows security alert
34 how to convert wma to mp3
33 virtual memory too low
26 how do i delete my myspace
26 google address book
26 comcast remote codes
24 converting wma to mp3
23 how to install windows on mac

Nice. That’s great information and ready to use. At least, ready enough for this quick and dirty solution.
My resultant script, when I take the command sequence and drop it into a Bourne shell script file, is:

#!/bin/sh

# Referrrers – shell script generates an email of popular referrer searchs from Google:

logfile=”/home/taylor/www/logs/askdavetaylor.com-access_log”
max=15

echo “Log file analysis for $(basename $logfile):”
echo “”

grep google.com $logfile | \
  cut -f11 -d\ | tr ‘&’ ‘\012’ | \
  grep “^q=” | sed ‘s/q=//;s/+/ /g’ | \
  sort | uniq -c | sort -rn | head -$max

exit 0

Now, finally, use “crontab -e” to add a line to cron that invokes this new script on a weekly basis. It brings up your favorite $EDITOR with your cron file within – if you have one. Crontab entries are in the form: minute, hour, day-of-month, month, day-of-week, command, so lets pick midnight on Mondays as our desired date and time.
In crontab, that looks like:

0 0 * * Monday command

There are two ways we can structure the command itself. We can just invoke the script, in which case the script itself will have to deal with turning the output into an email message, or we can do that within the crontab entry itself:

sh $SCRIPTS/referrers.sh | mail -s “Referrer report” taylor

That’s all there is to it. Make sure “SCRIPTS” is defined earlier in the crontab file, save and quit the edits, and you’re done. Tuesday morning you’ll have a report in your inbox.

Let’s Stay In Touch!

Never miss a single article, review or tutorial here on AskDaveTaylor, sign up for my fun weekly newsletter!
Name: 
Your email address:*
Please enter all required fields
Correct invalid entries
No spam, ever. Promise. Powered by FeedBlitz
Please choose a color:
Starbucks coffee cup I do have a lot to say, and questions of my own for that matter, but first I'd like to say thank you, Dave, for all your helpful information by buying you a cup of coffee!

7 comments on “How do I find out what searches people did to end up on my Web site?”

  1. wileeeb says:
    October 2, 2009 at 2:38 pm

    Do you know how to block? delete? uninstall? bing? I’ve removed it as a search provider, yet, when I open MSN my computer uses BING instead of IE. Not once, ever, has bing helped me. I’ve had to switch to GOOGLE each time to get the help I needed quickly – thus – I would like to block bing permanently (without having to pay a hefty fee to a computer repair person).

    Reply
  2. Michelle Rodriquez says:
    June 14, 2009 at 10:09 am

    My head was spinning after reading Dave’s answer. Yes I’m not into programming that’s why I use Google analytics. 🙂

    Reply
  3. Ken Wynn says:
    June 6, 2009 at 11:19 pm

    Thanks Dave. Great info. I just caught an interview you did with Affilorama. It’s great to see someone with your longevity who still is passionate about what you do. It’s true that if you love what you do, you never work a day in your life. Thanks.

    Reply
  4. Dave Taylor says:
    June 6, 2009 at 9:48 pm

    David, if you don’t have any files on your site with the word “google” in their name, you can indeed just search for google, rather than google.com. I don’t have that luxury: I have quite a few files that contain the word “google”.
    In addition, yeah, since there’s no way to know exactly what order the variables are going to be fed from the search box to the actual search engine, then “?q=” is just as likely as “&q=”: the first is the very first variable on the list, whereas the second is when it’s not the first.
    Thanks for the updates! 🙂

    Reply
  5. David says:
    June 6, 2009 at 11:53 am

    Dave,
    One more thing..
    Further analysis of my logs showed that some google sites are using “?q=” instead of “&q=”. I find that livesearch and bing use “?q=”. Yahoo is using “?p=”.
    Using a modified version of your command, I’m able to show the search references from all of these sites.

    Reply
  6. David says:
    June 6, 2009 at 9:58 am

    Hey Dave,
    Well done!
    Looking it my logs, I find other google search referrers coming from other google domains than google.com. I see searches from google domains such as: google.de, google.it, google.co.uk and google.fr.
    I changed the initial grep from google.com to google and counted a bunch more search terms.

    Reply
  7. Matthew W. Perry says:
    June 4, 2009 at 10:46 am

    Dave you never miss the mark on your scripts. They are always clean and well explained, thank you!

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Recent Posts

  • How Can I Enable Emergency Alerts in Spanish on Android?
  • Switch from 24-Hour Time to AM/PM in Ubuntu Linux?
  • Protect Your Connection and Privacy with Surfshark VPN
  • Can I Send Texts in iMessage with Effects from my Mac System?
  • How Do I Convert a Webp Graphics into a PNG in Windows?

On Our YouTube Channel

FATORK Wi-Fi Smart Portable Movie Projector -- DEMO & REVIEW

Tonor ORCA-001 USB Desktop Microphone -- REVIEW

Categories

  • AdSense, AdWords, and PPC Help (106)
  • Amazon, eBay, and Online Shopping Help, (161)
  • Android Help (202)
  • Apple iPad Help (145)
  • Apple Watch Help (53)
  • Articles, Tutorials, and Reviews (344)
  • Auto Tech Help (12)
  • Business Advice (199)
  • Chrome OS Help (25)
  • Computer & Internet Basics (764)
  • d) None of the Above (165)
  • Facebook Help (383)
  • Google, Chrome & Gmail Help (180)
  • HTML & Web Page Design (245)
  • Instagram Help (48)
  • iPhone & iOS Help (607)
  • iPod & MP3 Player Help (173)
  • Kindle & Nook Help (93)
  • LinkedIn Help (85)
  • Linux Help (167)
  • Linux Shell Script Programming (87)
  • Mac & MacOS Help (895)
  • Most Popular (16)
  • Outlook & Office 365 Help (26)
  • PayPal Help (69)
  • Pinterest Help (53)
  • Reddit Help (18)
  • SEO & Marketing (81)
  • Spam, Scams & Security (93)
  • Trade Show News & Updates (23)
  • Twitter Help (217)
  • Video Game Tips (66)
  • Web Site Traffic Tips (62)
  • Windows PC Help (922)
  • Wordpress Help (204)
  • Writing and Publishing (72)
  • YouTube Help (46)
  • YouTube Video Reviews (159)
  • Zoom, Skype & Video Chat Help (57)

Archives

Social Connections:

Ask Dave Taylor


Follow Me on Pinterest
Follow me on Twitter
Follow me on LinkedIn
Follow me on Instagram


AskDaveTaylor on Facebook



microsoft insider mvp


This web site is for the purpose of disseminating information for educational purposes, free of charge, for the benefit of all visitors. We take great care to provide quality information. However, we do not guarantee, and accept no legal liability whatsoever arising from or connected to, the accuracy, reliability, currency or completeness of any material contained on this site or on any linked site. Further, please note that by submitting a question or comment you're agreeing to our terms of service, which are: you relinquish any subsequent rights of ownership to your material by submitting it on this site. Our lawyer says "Thanks for your cooperation."
© 2022 by Dave Taylor. "Ask Dave Taylor®" is a registered trademark of Intuitive Systems, LLC.
Privacy Policy - Terms and Conditions - Accessibility Policy