I am through with XFS, once and for all. Well, at
least for laptops. I still think it’s a good filesystem when you
can ensure that the power never goes, and your hardware is
reliable, but it’s just not adequate for laptops or even
desktops.
I ran into some serious
problems a while ago, but managed to recover. Two nights ago,
however, three XFS filesystems on my laptop decided to
blow up and left my system thoroughly broken. I guess as the
hibernate
maintainer, I should really start doing my tests somewhere else
than my main system…
It all started out with a dist-upgrade and this
output:
dpkg: error processing /var/cache/apt/archives/dpkg_1.13.22_i386.deb
(--unpack): unable to make backup link of
'./usr/share/man/man1/dpkg-deb.1.gz' before installing new version: Unknown
error 990
Looking at /usr/share/man/man1, I started to
anticipate the apocalypse:
# ls -l /usr/share/man/man1
total 7956
?????????? ? ? ? ? ? ? 7zr.1.gz
?????????? ? ? ? ? ? ? 822-date.1.gz
?????????? ? ? ? ? ? ? CA.pl.1ssl.gz
?????????? ? ? ? ? ? ? Defoma::Common.1.gz
So I look at the log, and amidst kernel oops notices, there’s this lovely cookie:
Filesystem "hda6": Corruption of in-memory data detected. Shutting down
filesystem: hda6
Filesystem hda6 is /usr, so at that
time I figured “it could have been worse”, booted to single user,
and remade the filesystem with the intention to simply reinstall
all packages… when I found /var/lib/dpkg/info to be in
similar condition. The rest of /var seemed fine, but I
resolved then that there was no hope in reviving this system.
Fortunately I brought an external drive that had just enough
free space to hold my /home and some other stuff, but
since USB is really slow when it comes to shifting large
amounts of data, I decided to do something productive in the mean
time and to answer some outstanding mails. It wasn’t difficult to
get SSH back up, so I started to work on a remote machine and used
the time efficiently.
Some time later, though, I got confused in the mist of
screen sessions and was browsing my home directory on
the laptop, thinking I was elsewhere (my home directories are
mostly synchronised), when I noticed a directory in similar
condition as the above. Oh shit. Imagine my pain and fear as I
first thought my remote machine was also dying, imagine the sigh
when I found out I was on the local filesystem, and imagine the
shock when I realised that /home was also
affected by the XFS breakage…
A glance around /var confirmed that the
XFS breakage was actually spreading and had now
affected three filesystems on this machine. Fortunately, by that
time, I had copied everything to the external drive, and decided to
put my laptop and myself to sleep.
I woke the next morning to the task of reinstalling the thing and decided to be optimistic about it. After all, a reinstall would mean I could finally try partman-crypto and encrypt my laptop’s data to protect against leaking sensitive stuff in the case of loss or theft of my laptop.
The installation was not as painless as I had hoped, but that
was mainly because I ran into a known problem with the graphical
installer and partman-crypto, which does not allow to
set up volumes with random encryption keys (e.g. swap; see the
forthcoming announcement for the beta3 release of the installer),
and a bunch of smaller bugs. I had to restart the installation with
the traditional frontend to get what I wanted, but other than that,
I was
very impressed with what our installer development team has
accomplished! And a special round of gratitude to Frans Pop for not
losing his patience while helping me on several occasions
throughout the process.
Now, 24 hours after the incident, I am back to normal with a
fresh laptop and no data lost (except for one directory
which I pulled from a mirrored remote machine; it had no local
modifications (so why did XFS screw it up anyway?)).
The fonts are all jaggy, so there’s something I have to figure out.
All things considered, I am sad to have lost 24 hours, but I can
also relax more now, without fear of further XFS
breakage or loss of private data.
Update: Oh, and despite this, I did
choose ext3 for all my laptop’s filesystems. JFS was
really cumbersome and slow last time I tried it, and I surely would
not touch RazorFS after experiencing serious data loss on numerous
occasions.
Update: Two responses so far. Full ack for Julien (except for him laughing at me), Ingo’s post warrants a reply though.
First, ext3 is also journaled, and if you’re about
to say “yeah, but it’s a hack on top of ext2, well…
ext2 is damn mature, and journaling isn’t really
rocket science, so that “hack” isn’t going to be too complicated.
In fact, I like the idea of journaling being an option
rather than a built-in feature.
Second, of course you’re supposed to keep backups. But since you keep backups, my top requirement of a filesystem is not “how to get the data back”, but “how to ensure it does not break. If it breaks, I can reinstall and restore from backup, but that’s a certain amount of time lost. If it doesn’t break, well, that’s like stealing a little something back from death then, isn’t it?
Third, I do follow the linux-xfs mailing
list, but so what? I did not have
write cache enabled, and I was running the 2.6.17.7 kernel at
the time of the mishap.
Lastly, you point to “excellent tools” to recover the
filesystem. I am not sure how excellent xfs_repair
really is when it reports “bad magic number 0x0 on dir inode
4696727” during the run, claims to have fixed it, I mount
the filesystem, unmount it, run xfs_repair again and
get the same message.
No filesystem is perfect, and as we know from Biella’s
problems (among many others, ext3 is no exception.
But we did get her data off! So then it’s really an open
field again, crap filesystem against crap filesystem. I guess at
this point it helps to know that ext3 actually follows
VFS semantics, while on XFS, a completed
sync() syscall does not actually mean it has written
the data to disk (see e.g. #317479). And then there are
bugs like #239111…
ext3 it is for now. If that let’s me down, I’ll try
JFS. If that fails and noone has actually implemented a proper
filesystem, I might have a go myself. Haha.
Update: Alceste Scalas adds:
Ingo is right when he says that every filesystem has bugs – but bugs apart, the design of Ext3 (i.e. its physical-block journaling) makes it a far more reliable choice for desktop and laptop PCs, expecially for people without an UPS. An Ext3 filesystem could only crash because of a bug or an hardware failure, while an
XFSfilesystem can be trashed even without bugs or hardware failures, due to the unavoidable consequences of a power loss on PC-class hardware.
He also alerted me to this mailing
list post, which compares data=journal journaling
of ext3 (which almost noone does for performance
reasons) with XFS and RazorFS.
Update: You may also be interested in this post.
Update: Otavio Salvador points me to this FAQ entry,
which SGI must have added very lately. It explains how to deal with
the directory corruption that was part of my problem. I guess I
would have liked to know earlier, but I consider the outcome with
dm-crypt + ext3 a win anyhow.
Update: Martin Steigerwald pointed me to the bug report about the pre-2.6.17.7 XFS kernel bug.

