Filesystems and Power Failures

This technote includes:

Introduction
Guidelines for using hard drives in an environment subject to abrupt power failures
Recipe for creating hard drive corruption
How to limit the possible hard drive corruption
How to repair hard disk corruption
Power failures while writing

Introduction

How do we make sure that the hard disk integrity is maintained during power failures?

The DOS, EXT2, and QNX 4 filesystems aren't currently power-safe. Therefore file or partition corruption on a hard drive could occur during power failures. However a number of measures can be implemented to try to reduce the amount of eventual HDD corruption.

The Power-Safe (fs-qnx6.so) filesystem uses a copy-on-write (COW) technique to always maintain an uncorrupted version of the filesystem, even if a power failure occurs. For more information, see the Filesystems chapter of the System Architecture guide.

The rest of this document explains:

when corruption could occur
how to reduce HDD corruption

Guidelines for using hard drives in an environment subject to abrupt power failures

A lot of work in recent years went into the QNX 4 filesystem design to increase its ability to be resistant to power-failure scenarios (through the costly use of synchronised writes to enforce ordering), but it wasn't designed to handle corrupted sectors (either real bad blocks or virtual ECC ones).

This type of corruption during power failures occurs only when a lack of atomic sector-writes causes an ECC error to manifest as a bad block. On hard drives that retain valid sector contents, when a multiple sector metadata update is required, the filesystems use ordered writes to ensure that if power is lost (and thus only some writes were made), then they err on the side of lost resources rather than filesystem corruption.

For example, when growing a file in the QNX 4 filesystem, the blocks are first marked as used in the bitmap (a write to the bitmap) and then assigned to the file (a write to the inode). If a power failure occurs, there could be some blocks marked in the bitmap that don't belong to any file; the alternative ordering would have the file using free blocks in the bitmap which could then be allocated to another file, which results in a cross-linked file corruption.

Our recent investigations do show possible physical I/O errors (ECC errors) with loss of data. It's very difficult for the current QNX 4 filesystem, which overwrites metadata in-place, to prevent this situation.

Based on our investigation of multiple examples of corrupted filesystems, we have compiled this document with recommendations that should help limiting catastrophic damage to the hard disk (i.e. not being able to mount it).

Implementing some of the strategies, described below, to limit the catastrophic filesystem corruption and the associated recovering procedures should be less risky to projects than introducing radical filesystem changes.

Recipe for creating hard drive corruption

Hard drive corruption always occurs during a power failure (e.g. crank scenario or dead battery and stopping the alternator, large capacitors not available) while physically writing into a file or a directory, as opposed to writing to the driver cache or the drive cache.

Hard drive block corruption is a generic problem for drives that don't offer an atomic sector update guarantee. HDDs that do offer atomic sector update capabilities either leave the original data unchanged or completely write the new block content. With HDDs that don't offer this capability, half a block could be written and interrupted by an emergency head unload, which then becomes unreadable because the ECC doesn't match. It could affect any block and make it appear to be a bad I/O error. Our QNX 4 disk filesystem makes no guarantee in the presence of physical I/O errors of this type.

Various types of hard drive corruption could occur, depending on the scenario:

Corruption	Effects
File corruption	Loss of data in the file, or the inability to open the file. If this happens to data or configuration files, then some systems might not be able to restart themselves.
Directory corruption	Loss of files
Root block corruption	Inability to mount the disk partition or boot from it.
`.inode` corruption	Loss of files, or loss of long filenames.
`.bitmap` corruption	Inability to grow or delete files

How to limit the possible hard drive corruption

There are several ways to reduce the amount of hard drive corruption during a power failure. Avoiding writing to a file in the hard drive as much as possible or mounting a partition read-only is obviously the best way to prevent any corruption. However if writing to a hard disk can't be avoided, there are a few guidelines that will help reduce (but not completely eliminate) catastrophic corruption:

Use hard drives that offer an atomic sector update guarantee.
The filesystem layout on the target systems should be designed to make sure that the root block and the first block of the root directory are never written, so that the system can always boot.
The root block holds the inodes for the special filesystem structures (root directory, .bitmap, .inodes, .boot and .altboot). The root directory holds links back to this block.
- To avoid writing to the .inodes file, pre-grow it with the dinit command, like this:
  dinit -i block_number
  so there's no need to grow it at runtime. (The block_number is the number of blocks to pregrow .inodes by, where each block can store 8 entries.)
- Immediately following the dinit, make a .placeholder file in the root directory to fill out the first block and make sure that any application files placed in the root directory occupy a different sector.
Don't store any “working” (i.e. writable) files in the root directory of any partition. Use subdirectories one level below the root directory to store writable/readable “working” files and directories.
Refrain from creating long filenames (longer than 16 characters), to avoid writing to the .inodes file. (This might not always be possible: someone might insert a USB stick containing MP3 files with long names.)
Don't update the file's directory entry if the only change is the access time, by using io-blk's noatime option.
As a general rule, files should be closed, the disk cache should be flushed, and partitions umounted before shutting down the target (slay devb-xxx will perform all this).
Use io-blk's marking=none option on the block driver to stop the initial mounting of the partitions from writing to the root block (i.e. the dirty bit). This means that it is left up to the customer to determine if power was lost and the filesystems weren't shut down correctly. You can't rely on the dirty bit for this purpose if you use this option. So you'd need to check the filesystem at the next boot time.

How to repair hard disk corruption

Our disk filesystems make no guarantee in the presence of physical bad blocks (which is what the power failure results in). This type of IO errors isn't handled at the driver or filesystem level, but they could trigger a notification from the block driver (devb-*) to a user application that would attempt to repair the error (by writing to the bad block) or, in the worst case, reformat the disk. This would require a high-level application to monitor these errors and repair using application knowledge. The disk filesystems can't magically recover from any/all physical bad blocks.

The application layer should also determine whether the system was shut down correctly or not and take corrective measures as necessary. For example, it could start chkfsys, a useful recovery utility that checks the disk integrity:

# chkfsys -v -f -m /dev/hdxtxx

For more information, see its entry in the Utilities Reference.

Power failures while writing

What happens when we switch the power off, and files are still open for write access? Do we get invalid files / bad blocks?

If the power is physically switched off without taking the proper precautions the following could happen:

If files are open for write access but nothing is being written to the drive, you'll lose the unsaved changes (what's in the cache for instance) if the power loss occurs before that cache is automatically synced to the drive (umount or sync will force that to happen).
This shouldn't corrupt the file. However note that if the file was being grown, the inode is marked as BUSY on-disk until you close the file. If there's a power failure, you can get EBADFSYS errors for that file. In that case, you can use the qnx4 unbusy option to instead just truncate it back to the old size (see the fs-qnx4.so documentation for more details).
You could also reduce the write delay to flush a dirty block out of the cache sooner if this is a concern (see io-blk's delwri option for more details).
If files are open for reading/writing and you're currently reading files, corruption can't occur as long as the access time for that file isn't updated in the directory entry (use io-blk's noatime option).
If files are open for reading/writing, and you're currently writing to the drive (physically as opposed to the write cache), then in the best case you can corrupt the file; in the worse case, you can corrupt the directory entry where that file is. If that directory entry is the root directory, then you could corrupt the root block too. That is the absolute worse case because you may not be able to boot from the disk any more. The chkfsys, utility should be able to recover from the file or directory corruption.

For more information, see the Backing Up and Recovering Data chapter of the QNX Neutrino User's Guide.