Dear lazyweb: I desperately need a way to synchronise sensitive data on untrusted remote systems.
I found this article detailing how to do
encrypted off-site backups to Amazon S3. Inspired, I came up
with a method that used encfs in --reverse
mode, and rsync. The gist of the backupninja script to
drive this is:
DIR="$(mktemp -d -t encfs-rsync-mountpoint.XXXXXXXX)"
trap "[ -d $DIR ] && fusermount -z -u $DIR && rmdir $DIR"
echo xxxxxxxxx | \
encfs --reverse --stdinpass /srv/backups "$DIR" || die encfs mount
rsync --archive --one-file-system --hard-links --acls --xattrs \
--delete-during --rsync-path='rsync --fake-super' \
"$DIR"/ dest:/path/to/destination/dir \
|| die rsync
It seemed that this approach only had one downside:
encfs cannot IV-chain filenames in
--reverse mode, so two files foo in
different directories would have the same cypher-name. This is
something I could live with.
And so I started to transfer the 450Gb of sensitive data to an
untrusted system. But several hours into the process, I found a
severe problem: because encfs uses longer filenames
than the plain source, it ran up against the limits of the
ext3 directory index, which apparently cannot be
resized, but which also should not be reached, according to
this thread. Note that we’re talking about directories of
50,000 to 250,000 entries, and filenames up to 128 bytes. They’re
big numbers, but I don’t consider them extraordinary. I think every
filesystem ought to be capable of storing millions of files per
directory with filenames a lot larger than 128 bytes. We’re in the
age of terabyte consumer disks after all.
Update: I think the problem is that the
destination filesystem has a 1k block size, since it was originally
intended to be used as Maildir storage.
Theodore Tso explains in the aforementioned thread that the
block size b (in kilobytes) determines the size of the
directory index:
n = 200,000 × b³
which is 200,000 for 1k blocks, 1.6 million for 2k blocks, and 12.8 million for 4k blocks. I don’t know where the 200,000 constant comes from.
I reported the fault to the Debian bug tracker, but until this is solved, my problem persists, and I need to look elsewhere.
How would you synchronise 450Gb of sensitive data on a remote, untrusted system?
I have tried duplicity and had to discard it due to it’s crawling speed (it ran for 4 days before the network connection had a hiccough), and since it cannot be resumed.
Anything else?
Update: I should make the following two explicit:
-
I am not looking for a tool which would require me to shove 450Gb across the wire for every invocation of the backup. Thus, something like
rsyncis essential. -
I need a scriptable tool, so something like TrueCrypt seems out of the question, especially since it isn’t in Debian’s archive yet (and thus not in Debian
stable).
Tzafrir Cohen pointed me to rsyncrypto,
which looks interesting, but cannot talk to remote machines (and
thus requires brittle hacking), and also uses
ECB-style IV-chaining, which makes for weak cryptography used
for the file’s contents (as opposed to the file names, which is a
shortcoming of encfs I could accept).
Update: after reformatting the target
filesystem with 4k blocks, the next issue was filename
length. Unfortunately, there seems to be no solution to storing
files with names larger than ~170 bytes, since the crypto-algorithm
blows those up beyond the usual filesystem-specific maxmimum
filename limit of 255. No cookie. Still no usable solution.

