Dave Taylor answers free tech support questions about a wide variety of business and technical topics, including blogging, iphone help, ipod help, AdSense, MySpace, Sony PSP help, Mp3 players, Windows XP, Windows Vista, Linux, SEO, Mac OS X, Facebook, Twitter and LinkedIn.

Does Gmail do a good job of filtering spam?

Dave, I'm very tempted to move from Yahoo Mail to Google's Gmail, but I'm still unsure whether Gmail does a good job of filtering spam or not. I get lots of spam - darn it - and would love to hear that Gmail does a great job of filtering out all of this junk from my mailbox. What's your opinion?


Dave's Answer:

I've been quite impressed with both the functionality and spam filtering in Gmail, actually. Without me paying much attention, my spam folder fills up every day and only once in a blue moon is there a real, legitimate message buried in the junk. Similarly, not much spam makes it into my main mailbox either, so it appears to me that Google's figured out both sides of this problem.

My friend and colleague Aaron Dragushan of Wondermill has a pretty fascinating theory about how Gmail's spam filter works:

After noticing a pattern in the delivery of spam to my gmail inbox, wherein I would receive more spam if I left the account logged in (and refreshing itself) than when I logged in once a day. Since gmail is simply magnificent at handling spam, I wondered what might be going on. Here's an idea, but first a little background. Most web email providers use a "This is Spam!" button so that users can help them identify spam.

For example, if you sent 50,000 messages to AOL users and 10% identified it as spam, they might block your IP address from sending mail to AOL. To reduce the load on users, they could hold back most of the messages and pass along only the first 5,000 to see what people think. A trial run, if you will. Based on that small sample they can still tell if it's spam and take appropriate action.

Gmail adds a twist and a dramatic improvement. Using that same example above, Gmail would deliver all 50,000 messages and then watch what happens. Let's say they saw that 250 people have actually seen the message, and 23.7% said it was spam. Their anti-spam system might say, "Good enough, we can act on that". And this is where something special happens.

Gmail quietly reaches into the inboxes of those other 49,750 people and gently nudges the message into the spam folder. Their key insight was that because they're a web-based email provider, they don't so much "deliver" mail as make it available for viewing. All the people who haven't logged in or refreshed their browser window haven't actually "received" the message yet. If you were to make it disappear, they'll never know it was there, and don't have to trip over it. Huzzah!

And that's how Google harnesses the power of the group to help everyone.

I can't attest whether that's really how the Gmail spam algorithm works, but given that a typical junk message is sent to dozens, hundreds, or even thousands of users, and that it's not hard to match the same message in multiple mailboxes (I'd check Message ID and From address, personally), this certainly makes some intuitive sense.

Anyone want to test this theory out? Just stay logged in to Gmail for a day and count how many messages are automatically routed to your spam folder versus how many spam end up in your regular mailbox. Then on another day log in just once at the beginning and end of the day and compare filtered versus unfiltered spam messages.

In any case, yes, Gmail is very elegant and very well designed. The only limitation is that you still need to get an invitation to join, but fortunately I have quite a few I'm happy to share with Ask Dave Taylor readers: How do I invite people to join Gmail?

Note: this article was updated to reflect a useful refinement Aaron sent me regarding how he believes Gmail works.



Help others find this article at Del.icio.us, Digg, Netscape, Reddit, and Stumble Upon    

Subscribe!

Never miss another useful Q&A article again! Subscribe to AskDaveTaylor with Google Reader.

Comments

Laughing Squid, who hosts my lame website (along with two misanthropemanor.com email addresses), uses SpamAssassin to, well, assassinate spam, with thus far 100% accuracy. In the five months [11/05] that Squid has been my host, I've had 0 false alarms, 0 junk messages labeled as good - and ALL spam has come to the public address that's a sitting duck out on every page of the site, ripe for the trollbots' taking; the other ID, given to only a select few who don't use Microsoft Windows and/or Outlook and have a clue what "BCC" is used for, has received no spam at all.

By comparison, my Comcast address - set up only so I could subscribe to Usenet and NEVER USED for email or displayed in newsgroup postings - with the "spam filter" engaged, lets through around 25 spam messages a week. (In other words, 100% of the mail I receive at Comcast is spam, and it's coming to an address that's never been given out.) At my two AT&T WorldNet addresses - which have been abandoned - around 75 messages per week slip through the "spam filter." Two GMail addresses - one for friends and family who *do* use MS Windows or Outlook, and one for subscriptions, registrations and any other places that might share my ID with third-parties - receive practically NO spam.

As for "dictionary attacks," I've reused the same user names across every commercial domain at which I have an email address; that is, myname@att.net, myname@comcast.net, myname@gmail.com.

Both AT&T and Comcast use Brightmail for filtering, and since both ISPs are set to just toss those messages which they actually deem to be spam, I have no idea what the total amount of spam received is. I don't know whether GMail uses a third-party product or in-house, proprietary filtering, but whatever it is, it works pretty damn well. My GMail accounts are set to save local copies of spam, and over time only a handful of messages have landed in the filtered folders.

Judging from the simple the rules/analysis shown in the sample below (SpamAssassin's report attached to garden variety spam received at this address), it seems like it wouldn't take six NASA scientists working around the clock for a week to come up with the magic algorithms to separate the good from the bad (or the bad from the good), with maybe some minor end-user white list tweaking to allow access for those Microsoft Outlook and AOHell friends and family who insist on using (or forwarding) a screaming rainbow of HTML crap, in-line graphics and animated smileys in every email message.

Why then, can't AOHell, MSN, WorldNet, SBC/Yahoo!, Comcast et al figure out how to at least minimize the amount of spam that lands in everyone's In Baskets?


> From: 樸s
> To: themisanthrope@misanthropemanor.com
> Subject: *****SPAM***** A٦bQN줽q8%HW꺤Oζ?Themisanthrope
> Date: Thu, 26 Nov 2009 13:31:45 +0200
> Spam detection software, running on the system "squid16.laughingsquid.net", has
> identified this incoming email as possible spam. The original message
> has been attached to this so you can view it (if it isn't spam) or block
> similar future email. If you have any questions, see
> the administrator of that system for details.
>
> Content analysis details: (26.7 points, 7.0 required)
>
> pts rule name description
> ---- ---------------------- --------------------------------------------------
> 4.3 RATWARE_EGROUPS Bulk email fingerprint (eGroups) found
> 0.3 RCVD_NUMERIC_HELO Received: contains a numeric HELO
> 0.0 HTML_MESSAGE BODY: HTML included in message
> 0.1 HTML_70_80 BODY: Message is 70% to 80% HTML
> 0.1 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
> 2.2 HTML_IMAGE_ONLY_02 BODY: HTML: images with 0-200 bytes of words
> 1.8 MIME_QP_DEFICIENT RAW: Deficient quoted-printable encoding in body
> 0.7 HTTP_EXCESSIVE_ESCAPES URI: Completely unnecessary %-escapes inside a URL
> 2.4 HTTP_ESCAPED_HOST URI: Uses %-escapes inside a URL's hostname
> 2.7 SUBJ_ILLEGAL_CHARS Subject contains too many raw illegal characters
> 2.4 DATE_IN_FUTURE_96_XX Date: is 96 hours or more after Received: date
> 4.3 HEAD_ILLEGAL_CHARS Header contains too many raw illegal characters
> 3.0 FORGED_RCVD_NET_HELO Host HELO'd using the wrong IP network
> 1.2 MISSING_MIMEOLE Message has X-MSMail-Priority, but no X-MimeOLE
> 1.1 MIME_HTML_ONLY_MULTI Multipart message only has text/html MIME parts
> 0.1 MISSING_OUTLOOK_NAME Message looks like Outlook, but isn't
>
> The original message was not completely plain text, and may be unsafe to
> open with some email clients; in particular, it may contain a virus,
> or confirm that your address can receive spam. If you wish to view
> it, it may be safer to save it to a file and open it with an editor.
>
--
Misanthrope Manor
www.misanthropemanor.com
The point at which you veer off course. (tm)

Running a Windows enterprise was like working in the emergency room of Cook County Memorial. Working on Linux was like being a Maytag repair
man.
- Blog posting, LXer, "Why do people switch to Linux?"

Posted by: The Misanthrope at November 26, 2005 5:14 PM

I have been using Gmail for a while now, and I have NEVER had any spam land in my inbox!!

(unlike hotmail)

Posted by: Hugh at November 11, 2006 11:39 AM

Gmail - It used to be very good at seperating spam from real mail - But now... I get about 3 spam emails a day. It's the kind of spam where they have random words (ie. and then volcano ducks she said no wisdom) or some crap like that and then they add attachments.

It's not doing a good job for me anymore!!!

Posted by: M at December 19, 2006 10:56 AM

I'm using Gmail about a year already and i'm not getting spam. Not because gmail's spam filter but because i'm not getting spam at all. However i had important emails marked as spam way too MANY times and only god knows how many times i missed such email just because they were deleted from Spam folder before i noticed them.

The worst thing is that there no option to turn off the Spam folder or set up a POP download on all emails including those which gmail marks as spam! So only think i can do is log in to the web-client often and check the spam folder...

Posted by: geza at February 28, 2007 6:19 AM

GMAIL SUCKS MY IMPORTANT EMAILS WERE SENT TO SPAM WHICH I DELETED.

Posted by: Jonathan Lee at April 13, 2007 5:20 PM

Try using Gmail for more than a year - You'll notice about 1 spam email per day making its way into your mailbox.

I'm tired of these people saying Gmail has such an amazing spam filter, it's not that great people. Just today I got an email with "viagra" in the subject line - How gmail didn't filter that is beyond me...

Posted by: J.C. Biggums at August 21, 2007 10:50 AM

I am a CS Professor who has being using gmail since early on, and find its spam treatement incredibly good.

Yesterday as an in-class demo in setting up qmail I send some emails to various of my addresses, all of which go to different places but end up being forwarded to gmail. They all ended up in spam, so it will be a bit interesting to sort out how. The originating address was one I had just created on a machine we were configuring.

I have followed your work since the elm days. Keep it up.

Posted by: Douglas Harris at October 9, 2007 9:27 AM

I think Gmail works very well. But I don't understand the following:

In my spam folder I curently have 9 mails received in the last 20 days from apparently the same spam sender, e.g.:

marcel.janssen@gmail.com (Joanne Anthony),
marcel.janssen@gmail.com (Herminia Sneed) etc.

Finally gmail will delete them, that's ok, but THE QUESTION IS: as gmail software knows the address spam comes from - why gmail doesn't block the address?

And to avoid errors, simply to give me access to the list of recently banned senders whose spam I obtained.

Posted by: mk1 at October 28, 2007 1:38 AM

Gmail has been deleting SPAM in my account... dunno why, when my SPAM directory reaches ~900 it gets to 800 again... and I keep receiving SPAM.

Posted by: foobar at December 4, 2007 1:47 PM

GMail rules! In fact, Google rule. So much of what they do I am not only interested in but use on a regular basis and am constantly amazed at how they are improving they're systems. GMail is just case in point. I was just bitching about how many spam messages I get but forgot to bare in mind that ALL of the 400+ messages (accumulated over a month) in the spam box were Spam. Only 3 had managed to find its way into my normal box.

But to keep the system working relies on all of us doing our part and notifying Google of any undetected spam. And remember, as long as you haven't selected 'Delete Forever' then anything you do delete accidentally is stored in your bin folder for 30days.

Pi

Posted by: Seaniepie at April 24, 2008 4:06 AM

one MAJOR problem:
Spam filtering cannot be turned off!!!

I'm reading my gmail mail with MS outlook.
Since some messages find their way to the spam folder by mistake, i don't get them to my outlook, and have to sign in to gmail site to go over my spam folder!

Posted by: Saar at May 12, 2008 9:39 AM

my only real concern over the spam is that i wish that those that are legitimate spam would simply not go to my spam filter period. It is nice that the messages are deleted after 30 days, but for those of us who don't check it every once in awhile, they will have in the hundred's of spam and going through that is painstaking work and something i am simply not enjoying doing, i would rather the occasional spam going to the filter then having the filter fill itself hourly as it is.

Posted by: Ben at September 30, 2008 6:18 AM


I have a lot to say, but ...
Starbucks coffee cup I have a lot to say, and questions of my own for that matter, but most of all I'd like to say thank you for all your efforts on this Web site by buying you a chai!

I do have a comment, now that you mention it!









Remember personal info?


Please note that I will never send you any unsolicited commercial email. Ever.

While I'm at it, please note that by submitting a question or comment you're agreeing to my terms of service, which are: you relinquish any subsequent rights of ownership to your material by submitting it on this site.









Uniblue: Free Virus Scan

Search
Find just the answers you seek from among our 1700+ free tech support articles by using our Lijit search engine.


Member of the B5Media Network

Help!





Subscribe to
Ask Dave Taylor!

Add to Google Reader
Add to My Yahoo!
Subscribe in NewsGator Online

RDF   XML

Free Updates!
Sign up and get free weekly updates and special offers on books, seminars, workshops and more.


Recent Entries
Join the List!
Join my author info mailing list, where you'll learn about my upcoming books, speaking gigs, and more!


Book Links
© 2002 - 2008 by Dave Taylor. All Rights Reserved.

Note: This web site is for the purpose of disseminating information for educational purposes, free of charge, for the benefit of all visitors. We take great care to provide quality information. However, we do not guarantee, and accept no legal liability whatsoever arising from or connected to, the accuracy, reliability, currency or completeness of any material contained on this web site or on any linked site.

[whiteboard marker tray]