Filesystems

QNX Neutrino provides a rich variety of filesystems. Like most service-providing processes in the OS, these filesystems execute outside the kernel; applications use them by communicating via messages generated by the shared-library implementation of the POSIX API.

These filesystems are resource managers as described in this book. Each filesystem adopts a portion of the pathname space (called a mountpoint) and provides filesystem services through the standard POSIX API (open(), close(), read(), write(), lseek(), etc.). Filesystem resource managers take over a mountpoint and manage the directory structure below it. They also check the individual pathname components for permissions and for access authorizations.

Filesystems and pathname resolution

You can seamlessly locate and connect to any service or filesystem that's been registered with the process manager. When a filesystem resource manager registers a mountpoint, the process manager creates an entry in the internal mount table for that mountpoint and its corresponding server ID (i.e. the nd, pid, chid identifiers).

This table effectively joins multiple filesystem directories into what users perceive as a single directory. The process manager handles the mountpoint portion of the pathname; the individual filesystem resource managers take care of the remaining parts of the pathname. Filesystems can be registered (i.e. mounted) in any order.

When a pathname is resolved, the process manager contacts all the filesystem resource managers that can handle some component of that path. The result is a collection of file descriptors that can resolve the pathname.

If the pathname represents a directory, the process manager asks all the filesystems that can resolve the pathname for a listing of files in that directory when readdir() is called. If the pathname isn't a directory, then the first filesystem that resolves the pathname is accessed.

For more information on pathname resolution, see the section “Pathname management” in the chapter on the Process Manager in this guide.

Filesystem classes

Filesystems as shared libraries

Since it's common to run many filesystems under QNX Neutrino, they have been designed as a family of drivers and shared libraries to maximize code reuse. This means the cost of adding an additional filesystem is typically smaller than might otherwise be expected.

Once an initial filesystem is running, the incremental memory cost for additional filesystems is minimal, since only the code to implement the new filesystem protocol would be added to the system.

As shown in this diagram, the filesystems and io-blk are implemented as shared libraries (essentially passive blocks of code resident in memory), while the devb-* driver is the executing process that calls into the libraries. In operation, the driver process starts first and invokes the block-level shared library (io-blk.so). The filesystem shared libraries may be dynamically loaded later to provide filesystem interfaces and services.

A “filesystem” shared library implements a filesystem protocol or “personality” on a set of blocks on a physical disk device. The filesystems aren't built into the OS kernel; rather, they're dynamic entities that can be loaded or unloaded on demand.

For example, a removable storage device (PCCard flash card, floppy disk, removable cartridge disk, etc.) may be inserted at any time, with any of a number of filesystems stored on it. While the hardware the driver interfaces to is unlikely to change dynamically, the on-disk data structure could vary widely. The dynamic nature of the filesystem copes with this very naturally.

io-blk

Most of the filesystem shared libraries ride on top of the Block I/O module (io-blk.so). This module also acts as a resource manager and exports a block-special file for each physical device. For a system with two hard disks the default files would be:

These files represent each raw disk and may be accessed using all the normal POSIX file primitives (open(), close(), read(), write(), lseek(), etc.). Although the io-blk module can support a 64-bit offset on seek, the driver interface is 32-bit, allowing access to 2-terabyte disks.

Builtin RAM disk

The io-blk module supports an internal RAM-disk device that can be created via a command-line option (blk ramdisk=size). Since this RAM disk is internal to the io-blk module (rather than created and maintained by an additional device driver such as devb-ram), performance is significantly better than that of a dedicated RAM-disk driver.

By incorporating the RAM-disk device directly at the io-blk layer, the device's data memory parallels the main cache, so I/O operations to this device can bypass the buffer cache, eliminating a memory copy yet still retaining coherency. Contrast this with a driver-level implementation (e.g. devb-ram) where transparently presenting the RAM as a block device involves additional memory copies and duplicates data in the buffer cache. Inter-DLL callouts are also eliminated. In addition, there are benefits in terms of installation footprint for systems that have a hard disk and also want a RAM disk — only the single driver is needed.

Partitions

QNX Neutrino complies with the de facto industry standard for partitioning a disk. This allows a number of filesystems to share the same physical disk. Each partition is also represented as a block-special file, with the partition type appended to the filename of the disk it's located on. In the above “two-disk” example, if the first disk had a QNX partition and a DOS partition, while the second disk had only a QNX partition, then the default files would be:

Type	Filesystem
1	DOS (12-bit FAT)
4	DOS (16-bit FAT; partitions <32 MB)
5	DOS Extended Partition (enumerated but not presented)
6	DOS 4.0 (16-bit FAT; partitions ≥32 MB)
7	OS/2 HPFS
7	Windows NT
11	DOS 32-bit FAT; partitions up to 2047 GB
12	Same as Type 11, but uses Logical Block Address `Int 13h` extensions
14	Same as Type 6, but uses Logical Block Address `Int 13h` extensions
15	Same as Type 5, but uses Logical Block Address `Int 13h` extensions
77	QNX POSIX partition (secondary)
78	QNX POSIX partition (secondary)
79	QNX POSIX partition
99	UNIX
131	Linux (Ext2)
175	Apple Macintosh HFS or HFS Plus
177	QNX Power-Safe POSIX partition (secondary)
178	QNX Power-Safe POSIX partition (secondary)
179	QNX Power-Safe POSIX partition

Buffer cache

The io-blk shared library implements a buffer cache that all filesystems inherit. The buffer cache attempts to store frequently accessed filesystem blocks in order to minimize the number of times a system has to perform a physical I/O to the disk.

Read operations are synchronous; write operations are usually asynchronous. When an application writes to a file, the data enters the cache, and the filesystem manager immediately replies to the client process to indicate that the data has been written. The data is then written to the disk.

Critical filesystem blocks such as bitmap blocks, directory blocks, extent blocks, and inode blocks are written immediately and synchronously to disk.

Applications can modify write behavior on a file-by-file basis. For example, a database application can cause all writes for a given file to be performed synchronously. This would ensure a high level of file integrity in the face of potential hardware or power problems that might otherwise leave a database in an inconsistent state.

Filesystem limitations

POSIX defines the set of services a filesystem must provide. However, not all filesystems are capable of delivering all those services:

Filesystem	Access date	Modification date	Status change date	Filename length^a	Permissions	Directories	Hard links	Soft links	Decompression on read
Image	No	No	No	255	Yes	No	No	No	No
RAM	Yes	Yes	Yes	255	Yes	No	No	No	No
ETFS	Yes	Yes	Yes	91	Yes	Yes	No	Yes	No
QNX 4	Yes	Yes	Yes	48^b	Yes	Yes	Yes	Yes	No
Power-Safe	Yes	Yes	Yes	510	Yes	Yes	Yes	Yes	No
DOS	Yes^c	Yes	No	8.3^d	No	Yes	No	No	No
NTFS	Yes	Yes	No	255	No	Yes	No	No	Yes
CD-ROM	Yes^e	Yes^e	Yes^e	207^f	Yes^e	Yes	No	Yes^e	No
UDF	Yes	Yes	Yes	254	Yes	Yes	No	No	No
HFS	Yes	Yes	Yes	255^g	Yes	Yes	No	No	No
FFS3	No	Yes	Yes	255	Yes	Yes	No	Yes	Yes
NFS	Yes	Yes	Yes	—^h	Yes^h	Yes	Yes^h	Yes^h	No
CIFS	No	Yes	No	—^h	Yes^h	Yes	No	No	No
Ext2	Yes	Yes	Yes	255	Yes	Yes	Yes	Yes	No

^a Our internal representation for file names is UTF-8, which uses a variable number of bytes per character. Many on-disk formats instead use UCS2, which is a fixed number (2 bytes). Thus a length limit in characters may be 1, 2, or 3 times that number in bytes, as we convert from on-disk to OS representation. The lengths for the QNX 4, Power-Safe, and EXT2 filesystems are in bytes; those for NTFS, HFS, UDF, CD/Joliet and DOS/VFAT are in characters.

Image filesystem

Every QNX Neutrino system image provides a simple read-only filesystem that presents the set of files built into the OS image.

Since this image may include both executables and data files, this filesystem is sufficient for many embedded systems. If additional filesystems are required, they would be placed as modules within the image where they can be started as needed.

RAM “filesystem”

Every QNX system also provides a simple RAM-based “filesystem” that allows read/write files to be placed under /dev/shmem.

Note that /dev/shmem isn't actually a filesystem. It's a window onto the shared memory names that happens to have some filesystem-like characteristics.

This RAM filesystem finds the most use in tiny embedded systems where persistent storage across reboots isn't required, yet where a small, fast, temporary-storage filesystem with limited features is called for.

The filesystem comes for free with procnto and doesn't require any setup. You can simply create files under /dev/shmem and grow them to any size (depending on RAM resources).

Although the RAM filesystem itself doesn't support hard or soft links or directories, you can create a link to it by using process-manager links. For example, you could create a link to a RAM-based /tmp directory:

This tells procnto to create a process manager link to /dev/shmem known as “/tmp.” Application programs can then open files under /tmp as if it were a normal filesystem.

In order to minimize the size of the RAM filesystem code inside the process manager, this filesystem specifically doesn't include “big filesystem” features such as file locking and directory creation.

Embedded transaction filesystem (ETFS)

ETFS implements a high-reliability filesystem for use with embedded solid-state memory devices, particularly NAND flash memory. The filesystem supports a fully hierarchical directory structure with POSIX semantics as shown in the table above.

ETFS is a filesystem composed entirely of transactions. Every write operation, whether of user data or filesystem metadata, consists of a transaction. A transaction either succeeds or is treated as if it never occurred.

Transactions never overwrite live data. A write in the middle of a file or a directory update always writes to a new unused area. In this way, if the operation fails part way through (due to a crash or power failure), the old data is still intact.

Some log-based filesystems also operate under the principle that live data is never overwritten. But ETFS takes this to the extreme by turning everything into a log of transactions. The filesystem hierarchy is built on the fly by processing the log of transactions in the device. This scan occurs at startup, but is designed such that only a small subset of the data is read and CRC-checked, resulting in faster startup times without sacrificing reliability.

Transactions are position-independent in the device and may occur in any order. You could read the transactions from one device and write them in a different order to another device. This is important because it allows bulk programming of devices containing bad blocks that may be at arbitrary locations.

This design is well-suited for NAND flash memory. NAND flash is shipped with factory-marked bad blocks that may occur in any location.

Inside a transaction

Each transaction consists of a header followed by data. The header contains the following:

Types of storage media

Although best for NAND devices, ETFS also supports other types of embedded storage media by using driver classes as follows:

Class	CRC	ECC	Wear-leveling erase	Wear-leveling read	Cluster size
NAND 512+16	Yes	Yes	Yes	Yes	1 KB
NAND 2048+64	Yes	Yes	Yes	Yes	2 KB
RAM	No	No	No	No	1 KB
SRAM	Yes	No	No	No	1 KB
NOR	Yes	No	Yes	No	1 KB

Although ETFS can support NOR flash, we recommend instead the FFS3 filesystem (devf-*), which is designed explicitly for NOR flash devices.

Reliability features

ETFS is designed to survive across a power failure, even during an active flash write or block erase. The following features contribute to its reliability:

Dynamic wear-leveling

Flash memory allows a limited number of erase cycles on a flash block before the block will fail. This number can be as low as 100,000. ETFS tracks the number of erases on each block. When selecting a block to use, ETFS attempts to spread the erase cycles evenly over the device, dramatically increasing its life. The difference can be extreme: from usage scenarios of failure within a few days without wear-leveling to over 40 years with wear-leveling.

Static wear-leveling

Filesystems often consist of a large number of static files that are read but not written. These files will occupy flash blocks that have no reason to be erased. If the majority of the files in flash are static, this will cause the remaining blocks containing dynamic data to wear at a dramatically increased rate.

ETFS notices these underworked static blocks and forces them into service by copying their data to an overworked block. This solves two problems: It gives the overworked block a rest, since it now contains static data, and it forces the underworked static block into the dynamic pool of blocks.

CRC error detection

Each transaction is protected by a cyclic redundancy check (CRC). This ensures quick detection of corrupted data, and forms the basis for the rollback operation of damaged or incomplete transactions at startup. The CRC can detect multiple bit errors that may occur during a power failure.

ECC error correction

On a CRC error, ETFS can apply error correction coding (ECC) to attempt to recover the data. This is suitable for NAND flash memory, in which single-bit errors may occur during normal usage. An ECC error is a warning signal that the flash block the error occurred in may be getting weak, i.e. losing charge.

ETFS will mark the weak block for a refresh operation, which copies the data to a new flash block and erases the weak block. The erase recharges the flash block.

Read degradation monitoring with automatic refresh

Each read operation within a NAND flash block weakens the charge maintaining the data bits. Most devices support about 100,000 reads before there's danger of losing a bit. The ECC will recover a single-bit error, but may not be able to recover multi-bit errors.

ETFS solves this by tracking reads and marking blocks for refresh before the 100,000 read limit is reached.

Transaction rollback

When ETFS starts, it processes all transactions and rolls back (discards) the last partial or damaged transaction. The rollback code is designed to handle a power failure during a rollback operation, thus allowing the system to recover from multiple nested faults. The validity of a transaction is protected by CRC codes on each transaction.

Atomic file operations

ETFS implements a very simple directory structure on the device, allowing significant modifications with a single flash write. For example, the move of a file or directory to another directory is often a multistage operation in most filesystems. In ETFS, a move is accomplished with a single flash write.

Automatic file defragmentation

Log-based filesystems often suffer from fragmentation, since each update or write to an existing file causes a new transaction to be created. ETFS uses write-buffering to combine small writes into larger write transactions in an attempt to minimize fragmentation caused by lots of very small transactions. ETFS also monitors the fragmentation level of each file and will do a background defragmenting operation on files that do become badly fragmented. Note that this background activity will always be preempted by a user data request in order to ensure immediate access to the file being defragmented.

QNX 4 filesystem

The QNX 4 filesystem (fs-qnx4.so) is a high-performance filesystem that shares the same on-disk structure as in the QNX 4 RTOS.

The QNX 4 filesystem implements an extremely robust design, utilizing an extent-based, bitmap allocation scheme with fingerprint control structures to safeguard against data loss and to provide easy recovery. Features include:

Since the release of 6.2.1, the 48-character filename limit has increased to 505 characters via a backwards-compatible extension. The same on-disk format is retained, but new systems will see the longer name, old ones will see a truncated 48-character name.

For more information, see “QNX 4 filesystem” in the Working with Filesystems chapter of the QNX Neutrino User's Guide.

Power-Safe filesystem

The Power-Safe filesystem, supported by the fs-qnx6.so shared object, is a reliable disk filesystem that can withstand power failures without losing or corrupting data. It was designed for and is intended for traditional rotating hard disk drive media.

Problems with existing disk filesystems

Although existing disk filesystems are designed to be robust and reliable, there's still the possibility of losing data, depending on what the filesystem is doing when a catastrophic failure (such as a power failure) occurs:

Copy-on-write filesystem

To address the problems associated with existing disk filesystems, the Power-Safe filesystem never overwrites live data; it does all updates using copy-on-write (COW), assembling a new view of the filesystem in unused blocks on the disk. The new view of the filesystem becomes “live” only when all the updates are safely written on the disk. Everything is COW: both metadata and user data are protected.

To see how this works, let's consider how the data is stored. A Power-Safe filesystem is divided into logical blocks, the size of which you can specify when you use mkqnx6fs to format the filesystem. Each inode includes 16 pointers to blocks. If the file is smaller than 16 blocks, the inode points to the data blocks directly. If the file is any bigger, those 16 blocks become pointers to more blocks, and so on.

The final block pointers to the real data are all in the leaves and are all at the same level. In some other filesystems — such as EXT2 — a file always has some direct blocks, some indirect ones, and some double indirect, so you go to different levels to get to different parts of the file. With the Power-Safe filesystem, all the user data for a file is at the same level.

If you change some data, it's written in one or more unused blocks, and the original data remains unchanged. The list of indirect block pointers must be modified to refer to the newly used blocks, but again the filesystem copies the existing block of pointers and modifies the copy. The filesystem then updates the inode — once again by modifying a copy — to refer to the new block of indirect pointers. When the operation is complete, the original data and the pointers to it remain intact, but there's a new set of blocks, indirect pointers, and inode for the modified data:

A superblock is a global root block that contains the inodes for the system bitmap and inodes files. A Power-Safe filesystem maintains two superblocks:

The working superblock can include pointers to blocks in the stable superblock. These blocks contain data that hasn't yet been modified. The inodes and bitmap for the working superblock grow from it.

A snapshot is a consistent view of the filesystem (simply a committed superblock). To take a snapshot, the filesystem:

To mount the disk at startup, the filesystem simply reads the superblocks from disk, validates their CRCs, and then chooses the one with the higher sequence number. There's no need to run chkfsys or replay a transaction log. The time it takes to mount the filesystem is the time it takes to read a couple of blocks.

If the drive doesn't support synchronizing, fs-qnx6.so can't guarantee that the filesystem is power-safe. Before using this filesystem on devices — such as USB/Flash devices — other than traditional rotating hard disk drive media, check to make sure that your device meets the filesystem's requirements. For more information, see “Required properties of the device” in the entry for fs-qnx6.so in the Utilities Reference.

Performance

The performance of the filesystem depends on how much buffer cache is available, and on the frequency of the snapshots. Snapshots occur periodically (every 10 seconds, or as specified by the snapshot option to fs-qnx6.so), and also when you call sync() for the entire filesystem, or fsync() for a single file.

Synchronization is at the filesystem level, not at that of individual files, so fsync() is potentially an expensive operation; the Power-Safe filesystem ignores the O_SYNC flag.

You can also turn snapshots off if you're doing some long operation, and the intermediate states aren't useful to you. For example, suppose you're copying a very large file into a Power-Safe filesystem. The cp utility is really just a sequence of basic operations:

If the file is big enough so that copying it spans snapshots, you have on-disk views that include the file not existing, the file existing at a variety of sizes, and finally the complete file copied and its IDs and permissions set:

Each snapshot is a valid point-in-time view of the filesystem (i.e. if you've copied 50 MB, the size is 50 MB, and all data up to 50 MB is also correctly copied and available). If there's a power failure, the filesystem is restored to the most recent snapshot. But the filesystem has no concept that the sequence of open(), write(), and close() operations is really one higher-level operation, cp. If you want the higher-level semantics, disable the snapshots around the cp, and then the middle snapshots won't happen, and if a power failure occurs, the file will either be complete, or not there at all.

For information about using this filesystem, see “Power-Safe filesystem” in the Working with Filesystems chapter of the QNX Neutrino User's Guide.

DOS Filesystem

The DOS Filesystem, fs-dos.so, provides transparent access to DOS disks, so you can treat DOS filesystems as though they were POSIX filesystems. This transparency allows processes to operate on DOS files without any special knowledge or work on their part.

The structure of the DOS filesystem on disk is old and inefficient, and lacks many desirable features. Its only major virtue is its portability to DOS and Windows environments. You should choose this filesystem only if you need to transport DOS files to other machines that require it. Consider using the QNX filesystem alone if DOS file portability isn't an issue or in conjunction with the DOS filesystem if it is.

If there's no DOS equivalent to a POSIX feature, fs-dos.so will either return an error or a reasonable default. For example, an attempt to create a link() will result in the appropriate errno being returned. On the other hand, if there's an attempt to read the POSIX times on a file, fs-dos.so will treat any of the unsupported times the same as the last write time.

DOS version support

The fs-dos.so program supports both floppies and hard disk partitions from DOS version 2.1 to Windows 98 with long filenames.

DOS text files

DOS terminates each line in a text file with two characters (CR/LF), while POSIX (and most other) systems terminate each line with a single character (LF). Note that fs-dos.so makes no attempt to translate text files being read. Most utilities and programs won't be affected by this difference.

Note also that some very old DOS programs may use a Ctrl-Z (^Z) as a file terminator. This character is also passed through without modification.

QNX-to-DOS filename mapping

An attempt to create a file that contains one of these invalid characters will return an error. DOS (8.3 format) also expects all alphabetical characters to be uppercase, so fs-dos.so maps these characters to uppercase when creating a filename on disk. But it maps a filename to lowercase by default when returning a filename to a QNX Neutrino application, so that QNX Neutrino users and programs can always see and type lowercase (via the sfn=sfn_mode option).

Handling filenames

You can specify how you want fs-dos.so to handle long filenames (via the lfn=lfn_mode option):

If you use the ignore option, you can specify whether or not to silently truncate filename characters beyond the 8.3 limit.

International filenames

The DOS filesystem supports DOS “code pages” (international character sets) for locale filenames. Short 8.3 names are stored using a particular character set (typically the most common extended characters for a locale are encoded in the 8th-bit character range). All the common American as well as Western and Eastern European code pages (437, 850, 852, 866, 1250, 1251, 1252) are supported. If you produce software that must access a variety of DOS/Windows hard disks, or operate in non-US-English countries, this feature offers important portability — filenames will be created with both a Unicode and locale name and are accessible via either name.

The DOS filesystem supports international text in filenames only. No attempt is made to be aware of data contents, with the sole exception of Windows “shortcut” (.LNK) files, which will be parsed and translated into symbolic links if you've specified that option (lnk=lnk_mode).

DOS volume labels

DOS uses the concept of a volume label, which is an actual directory entry in the root of the DOS filesystem. To distinguish between the volume label and an actual DOS directory, fs-dos.so reports the volume label according to the way you specify its vollabel option. You can choose to:

DOS-QNX permission mapping

DOS doesn't support all the permission bits specified by POSIX. It has a READ_ONLY bit in place of separate READ and WRITE bits; it doesn't have an EXECUTE bit. When a DOS file is created, the DOS READ_ONLY bit will be set if all the POSIX WRITE bits are off. When a DOS file is accessed, the POSIX READ bit is always assumed to be set for user, group, and other.

Since you can't execute a file that doesn't have EXECUTE permission, fs-dos.so has an option (exe=exec_mode) that lets you specify how to handle the POSIX EXECUTE bit for executables.

File ownership

Although the DOS file structure doesn't support user IDs and group IDs, fs-dos.so (by default) won't return an error code if an attempt is made to change them. An error isn't returned because a number of utilities attempt to do this and failure would result in unexpected errors. The approach taken is “you can change anything to anything since it isn't written to disk anyway.”

The posix= options let you set stricter POSIX checks and enable POSIX emulation. For example, in POSIX mode, an error of EINVAL is flagged for attempts to do any of the following:

If you set the posix option to emulate (the default) or strict, you get the following benefits:

CD-ROM filesystem

The CD-ROM filesystem provides transparent access to CD-ROM media, so you can treat CD-ROM filesystems as though they were POSIX filesystems. This transparency allows processes to operate on CD-ROM files without any special knowledge or work on their part.

The fs-cd.so manager implements the ISO 9660 standard as well as a number of extensions, including Rock Ridge (RRIP), Joliet (Microsoft), and multisession (Kodak Photo CD, enhanced audio).

We've deprecated fs-cd.so in favor of fs-udf.so, which now supports ISO-9660 filesystems in addition to UDF. For information about UDF, see “Universal Disk Format (UDF) filesystem,” later in this chapter.

FFS3 filesystem

The FFS3 filesystem drivers implement a POSIX-like filesystem on NOR flash memory devices. The drivers are standalone executables that contain both the flash filesystem code and the flash device code. There are versions of the FFS3 filesystem driver for different embedded systems hardware as well as PCMCIA memory cards.

The naming convention for the drivers is devf-system, where system describes the embedded system. For example, the devf-800fads driver is for the 800FADS PowerPC evaluation board.

To find out what flash devices we currently support, please refer to the following sources:

Customization

Along with the prebuilt flash filesystem drivers, including the “generic” driver (devf-generic), we provide the libraries and source code that you'll need to build custom flash filesystem drivers for different embedded systems. For information on how to do this, see the Customizing the Flash Filesystem chapter in the Building Embedded Systems book.

Organization

The FFS3 filesystem drivers support one or more logical flash drives. Each logical drive is called a socket, which consists of a contiguous and homogeneous region of flash memory. For example, in a system containing two different types of flash device at different addresses, where one flash device is used for the boot image and the other for the flash filesystem, each flash device would appear in a different socket.

Each socket may be divided into one or more partitions. Two types of partitions are supported: raw partitions and flash filesystem partitions.

Raw partitions

A raw partition in the socket is any partition that doesn't contain a flash filesystem. The driver doesn't recognize any filesystem types other than the flash filesystem. A raw partition may contain an image filesystem or some application-specific data.

The filesystem will make accessible through a raw mountpoint (see below) any partitions on the flash that aren't flash filesystem partitions. Note that the flash filesystem partitions are available as raw partitions as well.

Filesystem partitions

A flash filesystem partition contains the POSIX-like flash filesystem, which uses a QNX proprietary format to store the filesystem data on the flash devices. This format isn't compatible with either the Microsoft FFS2 or PCMCIA FTL specification.

The filesystem allows files and directories to be freely created and deleted. It recovers space from deleted files using a reclaim mechanism similar to garbage collection.

Mountpoints

When you start the flash filesystem driver, it will by default mount any partitions it finds in the socket. There can only be one instance of the driver per flash device, and the driver must be provided with the full size of the flash chip. Note that you can specify the mountpoint using mkefs or flashctl (e.g. /flash).

Mountpoint	Description
`/dev/fsX`	raw mountpoint socket X
`/dev/fsXpY`	raw mountpoint socket X partition Y
`/fsXpY`	filesystem mountpoint socket X partition Y
`/fsXpY/.cmp`	filesystem compressed mountpoint socket X partition Y

Features

The FFS3 filesystem supports many advanced features, such as POSIX compatibility, multiple threads, background reclaim, fault recovery, transparent decompression, endian-awareness, wear-leveling, and error-handling.

POSIX

The filesystem supports the standard POSIX functionality (including long filenames, access privileges, random writes, truncation, and symbolic links) with the following exceptions:

These design compromises allow this filesystem to remain small and simple, yet include most features normally found with block device filesystems.

Storage efficiency

The filesystem uses a fixed-size 32-byte extent header for each piece of user data stored on the filesystem. The filesystem has a 512-byte “append buffer” that's used to coalesce small appends to a file before flushing them out to disk. There's no other caching of user data in the filesystem (read or write). The maximum extent size is currently limited to 4096 bytes of data (plus the 32-byte header) for files which are created at runtime, although mkefs can create 16 KB extents.

Wear leveling

Wear leveling is performed on blocks within a given partition. If the flash is split into multiple partitions, then each partition has a smaller pool to level across. Blocks never migrate from one partition to another.

The most common type of reclaim in the filesystem (a foreground reclaim) always does a double-reclaim operation. The first reclaim always takes the block in the partition with the lowest erase count, and swaps it with the spare. Then the block that requires the reclaim is swapped with the spare. This has shown to result in near-perfect wear leveling when foreground reclaims are used.

Background reclaim

The FFS3 filesystem stores files and directories as a linked list of extents, which are marked for deletion as they're deleted or updated. Blocks to be reclaimed are chosen using a simple algorithm that finds the block with the most space to be reclaimed while keeping level the amount of wear of each individual block. This wear-leveling increases the MTBF (mean time between failures) of the flash devices, thus increasing their longevity.

The background reclaim process is performed when there isn't enough free space. The reclaim process first copies the contents of the reclaim block to an empty spare block, which then replaces the reclaim block. The reclaim block is then erased. Unlike rotating media with a mechanical head, proximity of data isn't a factor with a flash filesystem, so data can be scattered on the media without loss of performance.

Fault recovery

The filesystem has been designed to minimize corruption due to accidental loss-of-power faults. Updates to extent headers and erase block headers are always executed in carefully scheduled sequences. These sequences allow the recovery of the filesystem's integrity in the case of data corruption.

Note that properly designed flash hardware is essential for effective fault-recovery systems. In particular, special reset circuitry must be in place to hold the system in “reset” before power levels drop below critical. Otherwise, spurious or random bus activity can form write/erase commands and corrupt the flash beyond recovery.

Rename operations are guaranteed atomic, even through loss-of-power faults. This means, for example, that if you lost power while giving an image or executable a new name, you would still be able to access the file via its old name upon recovery.

When the FFS3 filesystem driver is started, it scans the state of every extent header on the media (in order to validate its integrity) and takes appropriate action, ranging from a simple block reclamation to the erasure of dangling extent links. This process is merged with the filesystem's normal mount procedure in order to achieve optimal bootstrap timings.

Compression/decompression

For fast and efficient compression/decompression, you can use the deflate and inflator utilities, which rely on popular deflate/inflate algorithms.

The deflate algorithm combines two algorithms. The first takes care of removing data duplication in files; the second algorithm handles data sequences that appear the most often by giving them shorter symbols. Those two algorithms provide excellent lossless compression of data and executable files. The inflate algorithm simply reverses what the deflate algorithm does.

The deflate utility is intended for use with the filter attribute for mkefs. You can also use it to precompress files intended for a flash filesystem.

The inflator resource manager sits in front of the other filesystems that were previously compressed using the deflate utility. It can almost double the effective size of the flash memory.

Compressed files can be manipulated with standard utilities such as cp or ftp — they can display their compressed and uncompressed size with the ls utility if used with the proper mountpoint. These features make the management of a compressed flash filesystem seamless to a systems designer.

Flash errors

As flash hardware wears out, its write state-machine may find that it can't write or erase a particular bit cell. When this happens, the error status is propagated to the flash driver so it can take proper action (i.e. mark the bad area and try to write/erase in another place).

This error-handling mechanism is transparent. Note that after several flash errors, all writes and erases that fail will eventually render the flash read-only. Fortunately, this situation shouldn't happen before several years of flash operation. Check your flash specification and analyze your application's data flow to flash in order to calculate its potential longevity or MTBF.

Endian awareness

The FFS3 filesystem is endian-aware, making it portable across different platforms. The optimal approach is to use the mkefs utility to select the target's endian-ness.

Utilities

The filesystem supports all the standard POSIX utilities such as ls, mkdir, rm, ln, mv, and cp. There are also some QNX Neutrino utilities for managing the flash:

System calls

The filesystem supports all the standard POSIX I/O functions such as open(), close(), read(), and write(). Special functions such as erasing are supported using the non-POSIX devctl() function.

NFS filesystem

The Network File System (NFS) allows a client workstation to perform transparent file access over a network. It allows a client workstation to operate on files that reside on a server across a variety of operating systems. Client file access calls are converted to NFS protocol requests, and are sent to the server over the network. The server receives the request, performs the actual filesystem operation, and sends a response back to the client.

The Network File System operates in a stateless fashion by using remote procedure calls (RPC) and TCP/IP for its transport. Therefore, to use fs-nfs2 or fs-nfs3, you'll also need to run the TCP/IP client for Neutrino.

Any POSIX limitations in the remote server filesystem will be passed through to the client. For example, the length of filenames may vary across servers from different operating systems. NFS (versions 2 and 3) limits filenames to 255 characters; mountd (versions 1 and 3) limits pathnames to 1024 characters.

Although NFS (version 2) is older than POSIX, it was designed to emulate UNIX filesystem semantics and happens to be relatively close to POSIX. If possible, you should use fs-nfs3 instead of fs-nfs2.

CIFS filesystem

Formerly known as SMB, the Common Internet File System (CIFS) allows a client workstation to perform transparent file access over a network to a Windows system, or a UNIX system running an SMB server. Client file access calls are converted to CIFS protocol requests and are sent to the server over the network. The server receives the request, performs the actual filesystem operation, and sends a response back to the client.

The fs-cifs manager uses TCP/IP for its transport. Therefore, to use fs-cifs (SMBfsys in QNX 4), you'll also need to run the TCP/IP client for Neutrino.

Linux Ext2 filesystem

The Ext2 filesystem (fs-ext2.so) provides transparent access to Linux disk partitions. This implementation supports the standard set of features found in Ext2 versions 0 and 1.

Sparse file support is included in order to be compatible with existing Linux partitions. Other filesystems can only be “stacked” read-only on top of sparse files. There are no such restrictions on normal files.

If an Ext2 filesystem isn't unmounted properly, a filesystem checker is usually responsible for cleaning up the next time the filesystem is mounted. Although the fs-ext2.so module is equipped to perform a quick test, it automatically mounts the filesystem as read-only if it detects any significant problems (which should be fixed using a filesystem checker).

Universal Disk Format (UDF) filesystem

The Universal Disk Format (UDF) filesystem provides access to recordable media, such as CD, CD-R, CD-RW, and DVD. It's used for DVD video, but can also be used for backups to CD, and so on. For more information, see http://osta.org/specs/index.htm.

Apple Macintosh HFS and HFS Plus

The Apple Macintosh HFS (Hierarchical File System) and HFS Plus are the filesystems on Apple Macintosh systems.

The fs-mac.so shared object provides read-only access to HFS and HFS Plus disks on a QNX Neutrino system. The following variants are recognized: HFS, HFS Plus, HFS Plus in an HFS wrapper, HFSX, and HFS/ISO-9660 hybrid. It also recognizes HFSJ (HFS Plus with journal), but only when the journal is clean, not when it's dirty from an unclean shutdown.

Windows NT filesystem

The NT filesystem is used on Microsoft Windows NT and later. The fs-nt.so shared object provides read-only access to NTFS disks on a QNX Neutrino system.

Inflator pass-through filesystem

QNX Neutrino provides an Inflator pass-through filesystem, which is a resource manager that sits in front of other filesystems and inflates files that were previously deflated (using the deflate utility).

The inflator utility is typically used when the underlying filesystem is a flash filesystem. Using it can almost double the effective size of the flash memory.

If a file is being opened for a read, inflator attempts to open the file itself on an underlying filesystem. It reads the first 16 bytes and checks for the signature of a deflated file. If the file was deflated, inflator places itself between the application and the underlying filesystem. All reads return the original file data before it was deflated.

From the application's point of view, the file appears to be uncompressed. Random seeks are also supported. If the application does a stat() on the file, the size of the inflated file (the original size before it was deflated) is returned.

Running more than one pass-through filesystem or resource manager on overlapping pathname spaces might cause deadlocks.