[Previous] [Contents] [Index] [Next]

Disk & File Recovery

This chapter covers the following topics:

Introduction

The QNX filesystem achieves high throughput without sacrificing reliability. Although the filesystem is designed to be as robust as possible, there will always be situations in the real world where disk corruption will occur. Hardware will fail eventually, power will be interrupted, and so on.

The QNX filesystem has been designed to tolerate such catastrophes. It is based on the principal that the integrity of the filesystem as a whole should be consistent at all times. While most data is held in the buffer cache and written after only a short delay, critical filesystem data is written immediately. Updates to directories, inodes, extent blocks, and the bitmap are forced to disk to ensure that the filesystem structure on disk is never corrupt (i.e. the data on disk should never be internally inconsistent).

If a crash occurs, you can use the following file maintenance and recovery utilities:

These utilities will let you determine whether any damage was done to files that were open for writing at the time of the crash. These same utilities can also fix such damage, and in many cases will completely restore the filesystem.

Sometimes the damage may be more severe. For example, it's possible that a hard disk will develop a bad block in the middle of a file, or worse, in the middle of a directory or some other critical block.

Again, the utilities we've provided can help you determine the extent of such damage. You can often rebuild the filesystem in such a way as to avoid the damaged areas. In this case, some data will be lost, but with some effort, a large portion of the affected data may be recovered.

Making a recovery floppy

You should always have a recovery floppy on hand if, for whatever reason, a machine won't boot from hard disk.


Note: This procedure applies only to QNX systems that were shipped on diskette. If your QNX system came on CD-ROM, refer to the technote in /etc/readme/technotes/qnx_install, which documents a script for creating a boot floppy.

Before you begin, make sure that you're logged in as root and that Fsys.floppy is running.

Now follow these steps:

  1. Insert a QNX boot disk in your floppy drive.
  2. Copy the image to a temporary file on your hard disk:
    dd if=/dev/fd0 of=/tmp/floppy_image
  3. Insert a blank floppy in the drive. Format the floppy:
    fdformat -k0 -z2 /dev/fd0
  4. Copy the image (from your temp file) to the floppy:
    dd if=/tmp/floppy_image of=/dev/fd0
  5. Run dcheck to check the new floppy:
    dcheck /dev/fd0

    If this fails, retry steps 3 and 4 (fdformat and dd); if it fails twice, try a new floppy.

  6. Mount the floppy drive filesystem:
    mount /dev/fd0 /fd
  7. To make some room on the floppy, remove the following:
  8. Now copy these useful utilities to /fd/bin:
    cp /bin/sin /fd/bin/sin
    cp /bin/zap /fd/bin/zap
    cp /bin/rm /fd/bin/rm
    cp /bin/ls /fd/bin/ls
    cp /bin/spatch /fd/bin/spatch
    cp /bin/chkfsys /fd/bin/chkfsys
    cp /usr/bin/elvis /fd/bin/elvis
  9. Create a /etc directory:
    mkdir /fd/etc
  10. Copy a termcap file:
    cp /etc/termcap /fd/etc/termcap

    You should also edit the termcap file and remove the entries you won't need. The only entry you'll need is the one for QNX.

  11. Now we need to create two links:
    cd /fd/bin
    ln -s elvis vi
    ln -s fcat melt
  12. Finally, you'll need to modify the system initialization file (/fd/etc/config/sysinit) so that it now contains these lines:
    Dev -n 10 &
    Dev.con -n 4 -O 256 &
    reopen /dev/con1
    export PATH=/ram:.:/bin:/usr/bin
    export HOME=/
    dinit /dev/ram
    mount /dev/ram /ram
    prefix -A /pipe=/ram
    prefix -A /tmp=/ram
    fcat /util.tar.F | pax -vr
    cp /bin/esh /ram/sh
    melt -z </etc/logo.F
    rtc hw
    echo Welcome to QNX 4.25
    ontty /dev/con1 /bin/sh
    ontty /dev/con2 /bin/sh

That's it! Keep your recovery floppy in a safe place. If and when you ever need to use it, simply insert the floppy in a dead machine and power on -- the machine will boot QNX from the floppy.

Overview of QNX disk structure

In this section, we describe how the QNX filesystem stores data on a disk. Reading this section should help you recognize and possibly correct filesystem damage if you ever have to rebuild a filesystem.

If you have a C development package, the header file <sys/fsys.h> contains the definitions for all terms used in this section.

For an overall description of the QNX filesystem, see the Filesystem Manager chapter in System Architecture.

Partition components

A QNX filesystem may be an entire disk (in the case of floppies) or it may be one of many partitions on a hard disk. Within a disk partition, a QNX filesystem contains the following components:

Opening screen

The following blocks are always found, in this order, on a QNX disk partition:

Loader block

The loader block is the first block of a QNX partition. It contains the bootstrap loader that loads the QNX OS into memory.

Root block

The root block is the second block of a QNX partition. It contains the directory entry for the root (/), the inode entries for the inode file, and a label field.

Bitmap blocks

Several consecutive blocks follow the root block. The bitmap blocks form the bitmap for the QNX partition. One bit exists for each block on the partition, thus one bitmap block will be used for every 4096 disk blocks (corresponding to 2M of disk space).

If the value of a bit is zero, its corresponding block is unused. Unused bits at the end of the last bitmap block (for which there are no corresponding disk blocks) are turned on.

Bit assignments start with the least-significant bit of byte 0 of the first bitmap block -- which corresponds to QNX block #1.

Root directory

The root directory follows the bitmap blocks. The root directory is a "normal" directory (see the "Directories" section). It is initially created by the dinit utility with enough room for 32 directory entries (4 blocks).

As the following illustration shows, the root directory (/) contains directory entries for several special files that always exist in a QNX filesystem. The dinit utility creates these files when the filesystem is first initialized.

Opening screen

File: Description:
/. A link to the / directory
/.. Also a link to the / directory
/.bitmap Represents a read-only file consisting of the bitmap blocks.
/.inodes A normal file of at least one block on a floppy/RAM disk and 16 blocks on other disks, /.inodes is a collection of inode entries. The first entry is reserved and used as a signature/info area. The first bytes of the .inode file are "IamTHE.inodeFILE".
/.boot Represents an OS image file that will be loaded into memory during the standard boot process. This file will be of zero length if no boot file exists.
/.altboot Represents an OS image file that will be loaded into memory during the alternate boot process. This file will be of zero length if no alternate boot file exists.

Directories

A directory is simply a file that has special meaning to the filesystem. A directory file contains a collection of directory entries as shown in the following illustration:

Opening screen

The type of directory entry is determined by the bits in the d_status field, as follows:

Bit 3 (_FILE_LINK) Bit 0 (_FILE_USED) Comment:
0 0 unused directory entry
0 1 normal, used directory entry
1 0 link to an entry in /.inodes (which should be used)
1 1 invalid

The first directory entry is always for the file "." and includes a directory signature ("I[heart-symbol]QNX"). The hexadecimal equivalent of the [heart-symbol] character is 0x03. This entry refers to the directory itself by pointing to the entry within the parent directory that describes this directory.

The second entry is always for the ".." file. This entry refers to the parent directory by pointing to the first block of the parent directory.

Every directory entry either defines a file or points to an entry within the /.inodes file. Inode entries are used when the filename exceeds 16 characters or when two or more names are linked to a single file.

The first extent (if any) of a file is described in the directory/inode entry. Additional file extents require a linked list of extent blocks whose header is also in the directory/inode entry. Each extent block in the chain points to between 1 and 60 extents.

Links

Files with names greater than 16 characters and links to other files are implemented with a special form of directory entry. These entries are identified with the _FILE_LINK bit (0x08) of the d_status field being set.

For these files, a portion of the directory entry is moved into the /.inodes file.

Opening screen

Extent blocks

Extent blocks are used for any file that has more than a single extent. The directory entry di_xblk points to one of these extent blocks, which in turn defines where the second and subsequent extents are to be found.

An extent block is exactly one 512-byte disk block with the following form:

Opening screen

Each extent block contains:

The first extent block also contains a redundant pointer to the first file extent (also described within the directory/inode entry). This lets you recover all data in the file by locating this block alone.

Files

Files or file extents are groupings of blocks described by directory/inode entries; they have no structure imposed on them by the QNX filesystem.

Most files in QNX have the following overall structure:

Opening screen

File maintenance utilities

fdisk

The fdisk utility creates and maintains the partition block on a hard disk. This block is compatible with other operating systems and may be maintained by other OS versions of fdisk (although ours has the advantage of recognizing QNX-specific information). If the partition loader is missing or damaged, fdisk can create it.


Note: We recommend you keep a hard copy of the partition table information for every disk in your network.

dinit

The dinit utility creates (but Fsys maintains) the following:

chkfsys

The chkfsys utility is your principal filesystem maintenance tool. This utility:

The chkfsys utility assumes that the root block is valid. If the root block isn't valid, chkfsys will complain and give up -- you'll need to try restoring the root block with the dinit utility.

dcheck

The dcheck utility verifies that a disk has been correctly formatted by attempting to read every block on the drive. When the -m option is specified, dcheck removes any bad blocks from the disk allocation bitmap (/.bitmap).

If the file /.bad_blks is found, dcheck will update the bitmap and recreate the /.bad_blks file. You can run dcheck a few times to increase your chances of bad blocks being recognized and added to the /.bad_blks file.

zap

The zap utility lets root remove files or directories from the filesystem without returning the used blocks to the free list. You might do this for several reasons, including the following:

Recovering a zapped file

If you zapped a file in error, it's sometimes possible to recover the zapped file using the zap utility with the -u option immediately after the deletion. You can recover a zapped file using zap under these conditions:

spatch

The spatch utility lets you browse the raw disk and patch minor problems. You can sometimes cure transient disk problems by reading and writing the failing block with spatch.

Disk recovery procedures

Using chkfsys

The chkfsys utility is your principal tool for checking and restoring a potentially damaged filesystem. It can identify and correct a host of minor problems as well as verify the integrity of the entire disk system as a whole.

Normally, chkfsys requires that the filesystem be idle and that no files be currently open on that device. You'll have to shut down any processes that have opened files or that may need to open files while chkfsys is running.

To run chkfsys on a mount point, you'd simply type:

chkfsys /

The utility scans the entire disk partition from the root down, building an internal copy of the bitmap and verifying the consistency of all files and directories it finds in the process.

When it has finished processing all files, chkfsys compares the internal bitmap to the bitmap on the disk. If they match, chkfsys is finished. If any discrepancies are found, chkfsys will -- upon your approval -- rewrite the bitmap with data consistent with the files it was able to find and verify.

In addition to verifying block allocation (bitmap), chkfsys attempts to fix any problems it finds during the scan. For example, chkfsys can:

When to run chkfsys

It's a good idea to run chkfsys as part of your regularly scheduled maintenance procedures -- this lets you verify that the data on your disk is intact. For example, you might consider running chkfsys on your network servers every time they boot. An automated check on the filesystem at boot time guarantees that chkfsys will attempt to fix any problems it finds during the scan. To automate this process, add chkfsys to the server's sysinit.node file.

It's especially important to run chkfsys after a system crash, power outage, or unexpected system reboot so that you can identify whether any files have been damaged. The chkfsys utility checks the "clean" flag on the disk to determine whether the system was in a consistent state at the time.

The clean flag is stored on disk and is maintained by the system. The flag is turned off whenever a file is opened for update and is turned on after all open files have been closed and the associated data has been flushed from cache to disk. When the clean flag is set, chkfsys assumes that the filesystem is intact. If chkfsys finds the clean flag off, it tries to fix the problem.

The chkfsys utility supports a -u option, which overrides a set clean flag and tells chkfsys to run unconditionally. You might want to override the clean flag when:

Using chkfsys on a live system

The chkfsys utility normally requires exclusive use of the filesystem to provide a comprehensive verification of the disk.


Caution: There is some risk to running chkfsys on a live system -- both chkfsys and the filesystem are reading and possibly writing the same blocks on the disk. Also, the filesystem has internal cached data about files and directories that can't be updated when chkfsys makes a change. But static changes, in place, on files or directories that Fsys doesn't currently have opened will probably not cause problems.

If you're running an application that can't afford downtime or you couldn't run chkfsys because files were open for updating, try to run chkfsys with the -f option:

chkfsys -f /dev/hd0t77

This invokes a special read-only mode of chkfsys. It will give you a feeling for the overall sanity of your filesystem.

Recovering from a bad block in the middle of a file

Hard disks occasionally develop bad blocks as they age. In some cases, you might be able to recover most or even all the data in a file containing a bad block.

Some bad blocks are the result of power failures or of weak media on the hard disk. In these cases, sometimes simply reading then rewriting a block will "restore" the block for a short period of time. This may allow you to copy the entire file somewhere else before the block goes bad again. This procedure certainly can't hurt, and is often worth a try.

To examine the blocks within a file, you use the spatch utility. When you get to a bad block, spatch should report an error, but it may have actually read a portion of "good" bytes from that block. Writing that same block back will often succeed.

At the same time, spatch will rewrite a correct CRC (Cyclic Redundancy Check) that will make the block good again (but with possibly incorrect data).

You can then copy the entire file somewhere else, and then zap the previously damaged file. To complete the procedure, you mark the marginal block as bad (by adding it to the /.bad_blks file), then run chkfsys to recover the remaining good blocks.

If this procedure fails, you can use the spatch utility to copy as much of the file as possible to another file, and then zap the bad file and run chkfsys.

What to do if your system will no longer boot

If a previously working QNX system suddenly stops working and will no longer boot, then one of the following may have occurred:

The following steps can help you identify the problem. Where possible, corrective actions are suggested.

Step 1 -- Try booting from floppy or across the network

If you have a network to boot over, try booting your machine over the network. Once the machine is booted, you'll need to log in as root and then start up a local filesystem:

Fsys &

If you don't have a network, you'll need to boot from your recovery floppy (described earlier in this section) or from the QNX boot floppy that was used to install your system onto the hard disk. The filesystem will already be running in this case, and you'll be logged in as root.

Step 2 -- Start the hard disk driver

You now have to start the appropriate hard disk driver. For example, to start a driver for an Adaptec series 4 SCSI adapter, you would type:

Fsys.aha4scsi &

If you're using another type of driver, enter its name instead.

This should create a block special file called /dev/hd0 that represents the entire hard disk.

Step 3 -- Run fdisk

Running the fdisk utility will immediately give you useful information about the state of your hard disk.

The fdisk utility might report one of several types of problems:

Problem: Probable cause: Remedy:
Error reading block 1 Either the disk controller or the hard disk itself has failed. If the disk is good, replacing the controller card might let you continue using the disk. Otherwise, you'll have to replace the hard drive, reinstall QNX, and restore your files from backup.
Wrong disk parameters Your hardware has probably "lost" its information about this hard drive -- likely because the battery for the CMOS memory is running low. Rerunning the hardware setup procedure (or the programmable option select procedure on a PS/2) will normally clear this up. Of course, replacing the battery will make this a more permanent fix.
Bad partition information If the disk size is reported correctly by fdisk, but the partition information is wrong, then the data in block 1 of the physical disk has somehow been damaged. Use fdisk to recreate the correct partition information. It's a good idea to write down or print out a hard copy of the correct partition information in case you ever have to do this step.

Step 4 -- Mount the partition and the filesystem

At this point, you have verified that the hardware is working (at least for block 1) and that a valid partition is defined for QNX. You now need to create a block special file for the QNX partition itself and to mount the block special file as a QNX filesystem:

mount -p /dev/hd0 /dev/hd0t77 /hd 

This should create a volume called /dev/hd0t77. Depending on the state of the QNX partition, the mount may or may not fail. If the partition information is correct, there shouldn't be any problem. Since the root (/) already exists (on a floppy or on a remote disk on the network), we've mounted the local hard disk partition as a filesystem with the name /hd.

Your goal now would be to run the chkfsys utility on the disk to examine -- and possibly fix -- the filesystem.


Note: If you booted from floppy and you don't suspect there's any damage to the filesystem on your hard disk (e.g. the system was unable to boot because of a simple error introduced in the boot file or system initialization file), you can change the root prefix to your hard disk partition at this point with the following command, which will resume normal operation of the system:

/hd/bin/prefix -R /=/hd/

If you run this command, you can skip the rest of this section.


If the mount fails...

If the mount fails, the first portion of the QNX partition is probably damaged (since Fsys will refuse to mount what it considers to be a corrupted filesystem).

In this case, you can use the dinit utility to overlay enough good information onto the disk to satisfy Fsys:

dinit -hr /dev/hd0t77

The -r option tells dinit to rewrite:

You should now be able to reissue the mount command and once again try to create a mount point for a QNX filesystem called /hd.

After doing this, you'll need to rebuild the bitmap with chkfsys, even on a good partition.

Step 5 -- Run chkfsys

At least a portion of your QNX filesystem should now be accessible. You can use chkfsys to examine the filesystem and recover as much data as possible.

If the hard disk is mounted as /hd (e.g. the machine boots from floppy), enter:

/hd/bin/chkfsys /hd

If the hard disk is mounted as / (e.g. a network boot), enter:

chkfsys /

In either case, you should make note of any problems reported and allow chkfsys to fix as much as it can. What you do next depends on the result of running chkfsys.

If the disk is unrecoverable

If, for any reason, your disk is completely unrecoverable, read the next section, "Recovering lost files and directories." In some cases, you may need to reinstall QNX from floppy and restore your disk from your backup files.

If significant portions of the filesystem are irreparably damaged, or important files are lost, then restoring from backup might be your best alternative.

If the filesystem is intact

If your filesystem is intact, yet the machine still refuses to boot from hard disk, then either of the following is probably damaged:

To rewrite a partition loader, use fdisk:

fdisk /dev/hd0 loader

To rewrite the QNX loader, use dinit:

dinit -b /dev/hd0t77

You should now be able to boot your system.

Recovering lost files and directories

You may sometimes find that files or directories have been completely lost due to disk corruption. If after running chkfsys you know that certain key files or directories were not recovered, then you might be able to use spatch to recover some or all of this data.

Before attempting this, you should first familiarize yourself with the details of a QNX filesystem (see "Overview of QNX disk structure" in this chapter). You should also study the documentation for the spatch utility in the Utilities Reference.


[Previous] [Contents] [Index] [Next]