home of the madduck/ blog/
Storing sensitive data on untrusted remote systems

Dear lazyweb: I desperately need a way to synchronise sensitive data on untrusted remote systems.

I found this article detailing how to do encrypted off-site backups to Amazon S3. Inspired, I came up with a method that used encfs in --reverse mode, and rsync. The gist of the backupninja script to drive this is:

DIR="$(mktemp -d -t encfs-rsync-mountpoint.XXXXXXXX)"
trap "[ -d $DIR ] && fusermount -z -u $DIR && rmdir $DIR"
echo xxxxxxxxx | \
  encfs --reverse --stdinpass /srv/backups "$DIR" || die encfs mount
rsync --archive --one-file-system --hard-links --acls --xattrs \
  --delete-during --rsync-path='rsync --fake-super' \
  "$DIR"/ dest:/path/to/destination/dir \
  || die rsync

It seemed that this approach only had one downside: encfs cannot IV-chain filenames in --reverse mode, so two files foo in different directories would have the same cypher-name. This is something I could live with.

And so I started to transfer the 450Gb of sensitive data to an untrusted system. But several hours into the process, I found a severe problem: because encfs uses longer filenames than the plain source, it ran up against the limits of the ext3 directory index, which apparently cannot be resized, but which also should not be reached, according to this thread. Note that we're talking about directories of 50,000 to 250,000 entries, and filenames up to 128 bytes. They're big numbers, but I don't consider them extraordinary. I think every filesystem ought to be capable of storing millions of files per directory with filenames a lot larger than 128 bytes. We're in the age of terabyte consumer disks after all.

Update: I think the problem is that the destination filesystem has a 1k block size, since it was originally intended to be used as Maildir storage. Theodore Tso explains in the aforementioned thread that the block size b (in kilobytes) determines the size of the directory index:

n = 200,000 × b³

which is 200,000 for 1k blocks, 1.6 million for 2k blocks, and 12.8 million for 4k blocks. I don't know where the 200,000 constant comes from.

I reported the fault to the Debian bug tracker, but until this is solved, my problem persists, and I need to look elsewhere.

How would you synchronise 450Gb of sensitive data on a remote, untrusted system?

I have tried duplicity and had to discard it due to it's crawling speed (it ran for 4 days before the network connection had a hiccough), and since it cannot be resumed.

Anything else?

Update: I should make the following two explicit:

Tzafrir Cohen pointed me to rsyncrypto, which looks interesting, but cannot talk to remote machines (and thus requires brittle hacking), and also uses ECB-style IV-chaining, which makes for weak cryptography used for the file's contents (as opposed to the file names, which is a shortcoming of encfs I could accept).

Update: after reformatting the target filesystem with 4k blocks, the next issue was filename length. Unfortunately, there seems to be no solution to storing files with names larger than ~170 bytes, since the crypto-algorithm blows those up beyond the usual filesystem-specific maxmimum filename limit of 255. No cookie. Still no usable solution.