Blog

(posted on 15 Mar 2008)

I've been noticing that the amount of spam that I get has been going up. Up until about a month ago, I was receiving about 1000 spam messages a day but that has risen to about 3000 per day over the last week or so. I have been using GMail for managing my email and it had been great at filtering out this spam. Virtually no false positives (good messages going into the spam folder) and about 1-2% false negatives (spam not getting put into spam filter). That left me with about 10-20 spam messages a day to deal with. Not too much overhead. Sometime over the last couple of days, Google must have changed their spam filters in some way. I suspect it was in response to increasing levels of spam. The net effect was that the false positives went from practically none to about 70%. In other words, about 70% of my legitmate email was going into the spam folder with 3000 spam messages.

Well that made GMail's spam filter just about useless. It was time to see if I could figure out some ways to filter out some of this spam before it got to GMail so that I could do occasional, manual false positive checks in the spam folder. So the first question is "How is it possible to get 3000 spam messages a day?" That's easy. I have two domain names that send all email, regardless of address, to my GMail account. I've had these for many years and use them to create ad hoc "BACN" email addresses for signing up for new services. I'll call these domains my BACN domains and use BACN.com generically. I embed a standard code and the website's domain name into the email address so that if I start to get spam, I know who to blame (and block). For example, my email address might look like this: asdfa.newwebsite.com@BACN.com. The "asdfa" code (not what I really use) has been a string that I've embedded with the thought that at some time I could use this to help in my spam filtering. That time is now!

I've learned a few things about spam from using these catchall BACN email setups. First, a number of websites have sold/given/lost their email lists to spammers. A couple that come to mind are Napster, Bicycle.com, and my local gas and electricity company. It is also very interesting to see just how much spam is sent to made up email accounts. I see a lot of random looking string as email accounts. Others look like that they might be an account name from some other domain with my BACN domain tacked on the end. Others include HTML tags and attributes (like HREF or MAILTO) and are obviously due to HTML parsing errors when the spammers were trying to harvest email addresses from web pages.

Another factor in my large number of spam messages is that I manage several hundred domain names. Some are for my own projects, others are for clients, friends and relatives. A lot of these domains have legitimate email addresses that forward to me. I've yet to find any way to keep any email address spam free short of never telling anyone about it and not using it. Also, when registering these domains, they must have a legitimate contact email address and it's really important that I get any legitimate email that is sent to these accounts. I have 3 email addresses that are used for this purpose and so they end up in the public whois registration database entries for those domains. The whois database is a favorite place for spammers to harvest email addresses so these 3 addresses get spammed heavily.

So how to do some pretty brutal spam trimming? My solution is not for everyone. It involves Sendmail, Procmail and an extra GMail account. I happen to have the luxury (and the associated maintenance overhead) of having a dedicated Debian Linux server that handles some of my client's email and all of my email. I could run spamassassin or other linux server spam filtering software but I want to keep this simple to implement and manage. I've used these server based spam filters in the past but found them to be overkill for the use of a relatively small number of people. Spam filtering is not a service that I need to offer my clients. Most of the email that comes to this server just gets forwarded off to some other email account via a Sendmail virtusertable configuration file. Even my own email just gets forwarded to my GMail account. So my first line of defending myself from the spam was to create a local email account that I forward all of my BACN. I then implemented a procmail filter that would only forward mail that had the the special code "asdfa" in the To address field. What gets forwarded is what I call potentially good BACN. What gets left is pure spam and discarded. Here is an example of that filter with dummy data and email addresses inserted:

:0
* ^To: .*asdfa.*
! spamfilteraccount@gmail.com

spamfilteraccount@gmail.com is not a real GMail account (at least its not mine) but just a place holder for my real, spam filtering only, Gmail account. I forward my potentially good BACN to this GMail account along with my whois database email addresses and a few other heavily spammed accounts. In that GMail spam account I set it up to immediately forward all mail to my real GMail account. This only forward messages that don't get caught in it's spam filter. False positives in this stream of email are tolerable because this email is BACN plus some spam.

So now I have a 4 level spam filtering strategy.

A sendmail virtusertable file that blocks some known spammed email addresses that I just don't need any more. Like my bicycle.com website email address. I also forward email addresses that are my main contact email addresses directly to my main GMail account. This short circuit of the process reduces the chances of false positives and even if there are false positives, they will show up in my main GMail account. This account won't get too diluted by spam so I can occasionally check for them.
BACN+spam is sent to a local email account that has a procmail filter to strip out all email that doesn't have "asdfa" in the To field.
Potentially good BACN is sent a special spam GMail account that is used to filter out real spam sent to BACN email addresses.
Finally I use my main GMail account's spam filtering as a final line of defense but I can still check it for false positives.

I implemented this strategy about 3 hours ago. The procmail filter, has caught about 200 messages since then. All spam. The GMail spam account has caught about 40 spam messages. All real spam sent to my BACN and whois accounts. My main GMail has caught 5 spam messages and missed one that I had to manually mark as spam.

That feels much better!