home of the madduck/ blog/
A user-space filesystem for mail labeling

I am still looking for a solution that allows me to tag my mails similar to what Gmail offers.

Having thought about this for quite a bit now, I can see three solutions, implemented at different levels in the mail flow: as filesystem, patch to the IMAP server (Dovecot), or patch to mutt. Since this post is about the filesystem, I'll concentrate on that, but I'll also touch upon the other two solutions briefly.

A bit of theory

First, let's dig into the theory and requirements for a bit. Instead of sorting mails into folders (one mail can only ever live in one folder), I want to tag my mails, such that a single message can have multiple tags and hence be relevant to multiple categories. If you want to think in terms of folders, then a single message could be attached to multiple folders, e.g. via hard links. This is what Mairix approximates. The problem with this approach is that simple operations, such as deleting a message (removing it from the enclosing folder) or marking it as read (renaming the file) only affect the message in the current folder, not its duplicates/clones in the other folders.

So it's best to stay with a single file per message and to associate tags with it. Here, two approaches seem plausible: (a) extracting tags from the individual message files, and (b) storing tags/message pairs in a separate database. I have a gut feeling that (a) is the better approach, especially since the files we're dealing with all have the same format. And in fact, there is a semi-standard header for email messages to store such tags: X-Label.

One more paragraph of theory: instead of simple tags, like Gmail provides, I want hierarchical tags, for instance:

X-Label: Debian::pkg::mdadm
X-Label: Debian::popcon, PhD::Debian::stats

Hierarchical tags work similar to folders in that any message tagged as Debian::pkg::mdadm also belongs to the Debian::pkg and Debian categories. No rocket science.

mutt

Now it would probably be trivial to hack mutt and add a new layer of I mentioned earlier that this could be implemented in various places, and I'll start at the highest level, with mutt, the mail reader. For quite a while now, mutt provides search operators for the X-Label header, such that you can search or limit your mailbox e.g. like so: ~y \\<PhD::Debian\\> | ~y \\<popcon\\>. The problem with this approach is that it reuses existing functionality, which I need on a regular basis: searching and limiting. Furthermore, I think this approach is hackish and brittle and exposes too much of the underlying workings.

filters, which would be used in addition to the user-specified limit. On a high-level, the user would set a tag expression, like PhD::Debian | ::popcon, and may additionally choose to limit the display of messages to those sent from the debian.org domain: ~f debian.org$. mutt would then assemble the pieces and apply the filter (~y \\<PhD::Debian\\> | ~y \\<popcon\\>) ~f debian.org$ to the current view.

With a reasonable user interface to specify or pick labels, as well as a simple label editor, this would almost cut the mustard. Thanks to the header_cache patch (which also indexes the X-Label header), mutt fares reasonably well on Maildirs with thousands of messages (well, not always), and binding keys like s (save) and d (delete) to appropriate actions to add and remove labels determined from the destination or the current tags filter, would make mutt a force to be reckoned with.

Dovecot

Unfortunately, implementing this in mutt would mean that I'd be bound to the mail reader forever (and I like to pretend that's not the case, haha!).

An alternative would be to implement this on the IMAP server, which already provides IMAP folders which are not necessarily bound to physical, filesystem folders and can intercept all commands that manipulate messages or their status.

But while in theory this works fine, it'll break when offline tools, such as offlineimap come into play, as those (need to) instantiate every message they download, which brings us right back to the problem with duplicate/cloned files discussed before.

A filesystem idea

Trying to stay independent of the tools I use, I started playing with the idea of implementing mail message tagging at the filesystem level, using Fuse. The user would run a daemon process against a given Maildir holding all mail (the base store) and export a virtual directory hierarchy in which tags become Maildirs, backed by a cache of X-Label:filename pairs.

The virtual directory hierarchy should also export the familiar filesystem semantics:

Unlike tagfs, which uses an external SQLite database for tags, this filesystem should not use an external database for tags, but rather a run-time cache. That way, tags can be propagated to other machines with IMAP synchronisation tools.

Initially, I thought such a fileystem could even be deployed "underneath" the IMAP server, but then we obviously run into the same problems with offline tools as before.

Potentially, such a filesystem could be incredibly powerful as it could do its work for any user agent directly dealing with Maildirs (such as mutt), and it could also be used as basis for an IMAP server which is only ever accessed synchronously (in online vs. offline mode). In the latter case, one would have to ensure that filenames don't change across IMAP operations, or use the message IDs instead from the start.

But as good as it may sound, getting it to work right will require quite some attention, I think. Especially guaranteeing filesystem atomicity, as required for Maildirs, might be impossible through the Fuse layer.

The sad end

Unfortunately, I don't have the time to implement any of the above, but if I did, I would probably try to patch mutt first, even though such a patch may not make it into the upstream source any time soon. The filesystem idea sounds cooler and cleaner, but it does not yet support OR/NOT queries, and it will be quite a task to implement and debug. On the other hand, dealing with Maildirs of several thousand messages in mutt is a bit of a pain if those Maildirs are being updated externally (e.g. mail is delivered to them).

At least the idea is out now. If you are interested and want to work on this, please let me know so that I can link you up with other interested people.

And if you have any other input, please let me know. And no, I don't really want to install emacs to read my mail.

NP: Dream Theater: Train of Thought

Update: I've created a mailing list for public discussion of this topic, so please subscribe if you're interested.

Thomas Viemann suggested to use IMAP flags for the task, as these are standardised. However, I see two show-stoppers here: first, I am not sure whether it would be possible to set those flags from procmail on delivery (sieve can supposedly do it, but I consider sieve not adequate for my needs). And second, offline tools, such as offlineimap, which map IMAP folders to filesystem Maildirs, would have exactly the same problem with representing affiliation of a single message with multiple tags. You can read up on the discussion in the archives of the offlineimap mailing list, and soon the new mailtags list.

Martin Scholl had the idea to extend dbmail with the concepts of virtual folders (or stored searches). Naturally, as database-backed mail system, it would be trivial and fast to implement, but this approach bears two problems for me: first, it also falls short with respect to offline tools (see above), and second it would require me to replace my entire mail infrastructure. I am not even considering the implications of using a database for mail storage.

Update: I just found pytagsfs and have suggested to the author to abstract the tags store, so that e.g. an sqlite database can be used instead of using tags stored in files (which means a lot of redundancy when you have multiple files belonging to a collection, such as a music album.