home of the madduck/ blog/
XFS and zeroed files

Erich complains: "If the file is damaged, just fill it up with zeroes?" (in response to my recent post on filesystem problems).

Update: I rewrote the second sentence of the next paragraph (up until and including the footnote) to get things right. Thanks to Russell Cattelan for providing the info.

This must be the most misunderstood feature of XFS. What happens is that XFS logs all metadata changes to the journal, except for the inode size, which gets flushed to disk immediately for performance reasons [*]_. At this point, the file will actually be a sparse file, which is nothing more than a file whose metadata lists a file as being of a size different than it currently is (I realise the "sparse" does not really apply when the file is "overfull", i.e. when the metadata lists it as smaller than it really is, but I am lacking a good word for that). The disk extents get allocated only when the data actually hits the disk (that's XFS's famous delayed allocation mechanism). If the power fails before the data was flushed to disk and the journal entry cleared, XFS will serve zeroes, rather than the potentially random or sensitive data that is actually on disk. This is a good thing.

.. [] sincealmost every* write() changes the file size, it would be : a massive performance hit if every size change was logged. However, XFS actually violates its own journaling rules by doing this.

You can run into more or less the same problem with any journaling filesystem; the others just don't serve zeroes. Instead, they give you the data that's physically on the medium. Imagine the situation when the corrupt /etc/motd suddenly becomes a window to your previous /etc/shadow contents... I really prefer how XFS handles that. Sometimes you do get the old data back with the other filesystems, but this is because the filesystems may reuse the blocks of the old file. So it's a trade-off, and your choice between security and, uh, convenience.

The only way to protect against this is to use "physical-block journaling" (as opposed to "logical journaling"), which is only supported by ext3 as far as I know (option data=journal), at a massive performance loss. See this mailing list post by Theodore Ts'o for more info. Thanks to Alceste Scalas for sharing it with me.

PS: Thanks to Wessel Dankers and Eric Sandeen for their help on #xfs.

Update: Again thanks to Russell Cattelan, I just received word that code has been added to the 2.6.17 kernel which will flush an inode at close if it was truncated at some previous point. This should take care of many of the zeroed-file issues.

Update: RazorFS also supports data=journal. Thanks to Hans Kratz for pointing this out. He also adds some more information, parts of which I've written about previously:

One of the major problems of all journalled filesystems especially on notebooks/desktop machines leading to filesystem corruption is the harddisk writeback cache (enabled by default). With the writeback cache some writes are not immediately written to disk Now if a crash happens some of the data may not have hit the disk. The journalling filesystems however need a way to rely that the certain writes have hit the disk to ensure filesystem consistency.

There are two solutons: Enable write barrier support in the filesystems (avalable for ext3 since around 2.6.8, mount option: barrier=1, available for the other filesystems as well now probably) or disabling the write cache altogether with hdparm/sdparm.

XFS supports barriers as well, but unfortunately not yet on non-physical media, like RAID or dm-crypt. ext3 seems to work okay on both.

Enable your barriers today!

Update: Martin Steigerwald pointed out that ext3 can't do barriers on non-physical devices either:

root@shambala:~ -> mount -o barrier=1 /dev/shambala/ext3 /mnt/zeit
root@shambala:~ -> touch /mnt/zeit/barriertestfile
root@shambala:~ -> umount /mnt/zeit

Feb 28 20:31:46 shambala kernel: kjournald starting.  Commit interval 5 seconds
Feb 28 20:31:46 shambala kernel: EXT3 FS on dm-0, internal journal
Feb 28 20:31:46 shambala kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 28 20:32:02 shambala kernel: JBD: barrier-based sync failed on dm-0 - disabling barriers

The barriers are disabled right after the write access initiated by touch.

The device mapper code (which gets used for software RAID from kernel 2.6.23 onwards) makes that even explicit.