I will be writing this article in steps, descending from a conceptual level down to the code. At the moment, the article only discusses the main ideas and concepts in use by my mail filter. While the code is readily available, you may want to wait for a future edition of this article to discuss it; the code is somewhat unnecessarrily complicated in places and a bit broken in others. You can subscribe to receive updates to this article using the following feeds:
Like many active participants of the Free software field, I have to deal with massive quantities of mail. To give you a vague idea, my email addresses receive a cumulative 44'000 messages/day (2007 year average), more than one every two seconds. Of those, around 93% are filtered out as spam at the perimeter, and three thousand or so hit my mail filter. I see about 100 of them each day, and I would not want that to be more.
The mail filter's job is to classify and handle these messages. For instance, a simple mail filter might consult SpamAssassin and take its advice to deliver the message to the inbox or the spam folder.
Another task of a mail filter would be to sort mails: mails to my university address should be delivered to the university folder, mails to my Debian address to the Debian folder, and mails about ponies to my high priority queue.
The mail filter can make decisions about the fate and final destination of an incoming message, judging from its headers and body, and it can consult other programmes. On a Unix platform, this opens the doors fairly wide for cool stuff.
The following article explains my mail filtering requirements, additional productivity features I delegated to the filter, and how I coded and put it all into production.
- Filtering on recipient address
- Identity and mail contexts
- My basic mail filtering requirements
- Additional features I want and need
Email headers can be trivially faked… except for one: the
recipient address. Instead of complex header analysis, subject
pattern matching, or other brittle approaches, I do most of my mail
filtering according to which recipient address was used in a given
message. I have made an effort in the past years to use specific
addresses for specific tasks. Thus, for anything research-related, I use the
university address, and when
somesite.com asks me for
an email address when I sign up for their news letter, I use
Moreover, I sometimes make use of the plus-extension of any
decent mail server, and use e.g.
firstname.lastname@example.org as correspondence
address in a discussion on "sometopic". Unfortunately, I found that
the plus extension does not work on a plethora of services out on
the Internet, which lead me to adopt the (better) equal-sign-based
approach shown above.
I do deal with different institutions and projects, and when there's no real overlap, I prefer to keep them separate from each other. However, instead of polling many different mailboxes, I have all mail arrive at a single point — a single point of failure, but the best solution I've found so far.
I like the single-inbox approach to mail: all mail arrives in a single mailbox and as you deal with the messages, you move them away or delete them. However, given my multiple identities, I prefer to assume a certain context and do not want to deal with unrelated messages. Therefore, I organise and monitor my inboxes such that I only see mail related to my current context, while other mail queues up in the inbox(es) related to other contexts; when I later switch contexts, I can then start to process the queue.
In addition, certain contexts have subcontexts. For instance, for the Debian project, I maintain a number of packages. I prefer to work in burst mode on those: I'll set aside a day in the future to work on a certain package, and on that day, I switch context to the Debian-subcontext for that package and process all the mails which have queued up in the associated mailbox.
With certain messages, it's not possible to identify a single context — my research is Debian-related, and certain messages apply to both, my involvement in Debian, as well as my research. I have not found a good way to deal with this aspect. Obviously, tags would do the trick, but my mail client cannot handle tags satisfactorily and a filesystem-based approach has not addressed the issue either. Until this problem is solved, I will stick to my suboptimal category hierarchy. I have looked at other clients, but none of them came close to mutt in terms of suitability to my workflow, and flexibility.
My mail filter's main purpose are the following tasks. If you've inspected mail filters before, none of the above will seem particularly special to you, and you might immediately suggest procmail as the tool for the task. In fact, that's what I chose, but more on that later.
Sometimes, a message arrive at my filter more than once, which
happens mostly for list mail.
formail -D is
made for this task.
One problem I have yet to solve relates to my use of recipient addresses as filter criterion: if I get copied on a list message, it's likely that the list message arrives after the mail sent to me directly. With default duplicate weeding, this means that the direct mail gets delivered to my inbox, while the list message is dropped, leaving list mail in my normal inbox. I'd prefer for it to be the other way, but have not found a reasonable approach to that.
A large number of incoming messages are plain unwanted. This includes mail to addresses which have made it into spammer directories or matching obvious patterns, or messages I simply want to ignore (e.g. the monthly mailman password reminder). Those are all discarded at an early stage and hence don't cause much work for the filter.
Another type of unwanted mail are most posts to mailing lists. I am subscribed to approximately 300 mailing lists which generate around two thirds of the incoming mail traffic that hits my filter. After spam filtering, about a thousand messages remain, but guess who won't be reading or even processing those! Yeah, not I.
At some point I thus came up with an approach I called "justme" then, and even though the name's a bad match, it stuck. The basic idea is simple: with a few exceptions, all mass mail is discarded, unless it
- comes from one of my machines, or
- was sent in reply to a message I sent, or
- mentions my name or nickname anywhere, or
- contains one or more keywords I defined, or
- is of (list-)administrative nature
My setup has the provision to exempt certain mailing lists from this filtering, in case I want to follow all discussions.
Today's spam filters are rather advanced. SpamAssassin does an amazing job for me, and it gets infinitely better as soon as you start to train it properly. However, it comes at a fairly high cost, taking several seconds and CPU cycles to arrive at a verdict for a single message. Part of the motivation to write this mail filter was to address the limitations of my old setup, which was on its last legs and about to break down under the load.
I use pretty strict access control on my postfix mail server: it refuses 93% of incoming mail, according to some simple checks and a number of [[!wp DNSBL]]s. Some may call this setup overly fascist, and it's quite possible that I am rejecting legitimate mail, but I believe that those cases are for the sender to worry about, who should be sending mails with a standards-compliant mailer or via a reputable provider if s/he wants them to arrive. It reduces the load of my mail filter to less than a tenth.
Moreover, I already weed out a large quantity of unwanted mail,
as told above. Therefore, my spam filters only need to process a
fairly small number of messages already, but I can still do
better. Specifically, there is little use in analysing
messages I know to be good, e.g. messages sent by
rss2email. Moreover, some companies send legitimate
mail which consistently trips the spam filters. Since they use an
exclusive address to reach me, and assuming they didn't sell it to
spammers (few companies I ever dealt with do), I don't need to
inspect messages to this address for spaminess.
I am aware of whitelist features of most spam filters, but I chose to do this whitelisting even before invoking the spam filter, simply because it saves resources and provides more flexibility. My system load is back to normal anyway.
The Internet would be a much better place, if an email was just an email, and all emails were created equally. Unfortunately, this is so far from reality that it makes me cry. Thus, at various points during the filtering of a message, I need to be able to handle special cases, or even rewrite messages to normalise them.
I keep a three-month rollover archive of copies of all mails that pass the aforementioned filters, which I call the "spool". This has often been helpful when I accidentally deleted a message with my mail client I should have kept.
In addition, whenever my filters discard a message, they actually store it to a similar, but separate rollover archive called "discard". Those are simply mailboxes, but a script on the server runs over them and purges messages which are older than three months. This is extremely useful if mails were accidentally filtered, or I join in a discussion which has previously been weeded out — I can then just obtain the entire thread and work with it as usual.
I also forward all my list mail to Gmail to be able to use their search interface once in a while, and to.
Finally, for messages which are still alive and undestined, the filter should determine the appropriate inbox and deliver the message to it. I can do this mostly using recipient addresses, so this is hardly worth writing about.
Any complex setup can blow up, and when it does it's nice to be able to figure out the cause. Pained by memories of retrospective debugging sessions, I decided to
Delaying delivery Logging Defering Resubmission and cleanup Autotraining spam filters Tickler queue