home of the madduck/ blog/
Information management 2.0

Dear Lazyweb: my information management is in dire need of an upgrade. In particular, I am looking for two features, the second of which is an extension of the first. So let's start with the first:

Given a certain issue — a job, a project, … — I write and receive emails, send and receive letters, make and receive phone calls, and author all kinds of documents. In addition, I may have other tidbits of information, such as bookmarks and contact information of the involved people, which are related to the issue at hand.

I can store the documents and letters in folders on my filesystem, but what emails and bookmarks and contact information are isolated from the filesystem by the application through which I access this information, even though they may be stored in standard formats. I do not want to create redundant information by copying emails to the respective folders in the filesystem, not only because of the error potential in forgetting that one message which you later need.

What I want is a high-level view — ideally command-line based — into my information, such that for any given project, I can easily and immediately access all related correspondence, documents, and data. I almost want something like a customer relation management system (CRM), except that I don't want to organise my data contact-centric, but rather issue-centric: an issue may involve nobody, a dozen other people, or be about one single person.

I took a brief look at beaglefs and was briefly impressed that it did not depend on roughly 5 gigabytes of GNOME/Mono, but then also decided that it's not what I wanted, because it's too fileystem-centric and I also do not want a hundred mounted pseudo filesystems in my home directory.

Beagle, on the other hand, does have half a million dependencies and plugins for almost every other application that has half a million dependencies, but it's basically unusable to people who use simple, non-graphical tools for their information management. I am going to check strigi-daemon and doodle, but I don't really expect any different.

Does anyone know of a high-level tool that I could use to tie together all my information sources/types?

Then, on to the second feature I want: tags. You know it, tags are web 2.0 and web 2.0 is sexy and since I want to be sexy, I need tags. They are also marginally useful and I could use them to place the link the email from a friend about Debconf7 to his name, Debian, the conference, as well as the client we're going to be visiting on the way back. Or I could associate the slides for my talk on method diffusion with Debian, the conference, and my Ph.D. research. Marginally useful but soooo sexy.

Tags should be hierarchical, such that foo::bar::baz is a subset of foo and foo::bar (and a query for foo would automatically yield results for foo::bar::baz too). Wildcards should be supported, such that I could search for foo::+::baz and also find foo::bla::baz, and of course you should be able to perform the standard logical functions, such as AND/OR/NOT on them. Enrico calls this "faceted categorisation" and I would never question him on such topics.

I do like the idea of a filesystem, backed by an index, but otherwise using metadata stored in extended filesystem attributes, which is what BeagleFS does. However, I'd like it to be ad-hoc, such that the query is not a mount option, but rather determined by the filesystem syscalls. For instance, the directory listing of /TAGS/foo::bar/ should contain all files tagged with foo::bar; /TAGS/-foo::bar/, /TAGS/foo::bar,+::baz/, and /TAGS/foo::+.bla::+/ could be the shell-friendly ways to encode NOT, OR, and AND.

If you think about it for a minute, then it's actually not that hard and should be a weekend job to implement on top of FUSE, but it'll only really work when each filesystem node corresponds to exactly one entity. Contacts and bookmarks are trivial to implement that way (VCards and .lnk files), but e-mail is not: with the two popular storage backends mailbox and Maildir, an e-mail message is either a set of lines in a file, or a file in a directrory hierarchy.

At first, Maildir may sound like the ticket, but when you consider that a message's new status is encoded in the path and not in the message's filename or content, while the filename is used to encode other status and index information, such as deleted or a message's size, it quickly becomes a nightmare — this is why I consider mairix to be more or less unusable other than for quick, one-time searches.

To be able to properly deal with email in the filesystem I am proposing, one would need to use mailboxes that only ever stored one message (which would be a waste of space and is not supported by any of dozens of e-mail processing or reading tools), or a new format that used only extended filesystem attributes for metadata, never location of a file or its name. And that does not exist yet, isn't debugged, let alone supported by any tools.

Where to, Lazyweb? Comments welcome.

NP: Gazpacho: When Earth Lets Go

Update: as many have noted, of course I am looking for something like the semantic web, or closely related to it… however, given the current state of things, that doesn't get us any further. Also, I don't want any of the Web or social and community stuff that seems to be an inseperable part of the semantic web or the OWL. I just want the semantics for a single-user environment. Thus, NEPOMUK, which three people have pointed to, is not what I am looking for.

On the issue of one-mail-per-file, Remi Vanicat pointed me to Gnus' nnml storage:

If you use this back end, Gnus will split all incoming mail into files, one
file for each mail, and put the articles into the corresponding directories
under the directory specified by the ``nnml-directory`` variable.

As long as nnml is a Gnus-specific format, it's useless to me. I want to continue using standard tools like mutt, procmail, formail, and IMAP servers.