TAR Filesystem

The .tar file resource manager is similar to the previous chapter's RAM-disk filesystem manager, and they share quite a bit of code. However, it illustrates a different way of managing files — as a virtual filesystem map of a physical file.

This resource manager lets you cd into a .tar (or, through the magic of the zlib compression library, into a .tar.gz) file, and perform ls, cp, and other commands, as if you had gone through the trouble of (optionally uncompressing and) unpacking the .tar file into a temporary directory. (That's a small lie; a directory entry can't be both a file and a directory at the same time, so you can't cd into the .tar file, but the resource manager creates a .tar.dir file that you can cd into.)

In the Filesystems appendix, I present background information about filesystem implementation within the resource manager framework. You might want to refer to that before, during, or after you read this chapter.

I assume that you've read the RAM-disk Filesystem chapter, because anything that's common between these two resource managers isn't repeated here.

This chapter includes:

Requirements
Design
The code
Variations on a theme
References

Requirements

The requirements for this project stemmed from my desire to “just cd into a .tar file.” I presented a similar, but much more basic, version of the .tar filesystem when I wrote the Writing a Resource Manager course for QSSL. In that course, students were asked to parse a .tar file (with lots of helper code already in place), and write a tiny virtual filesystem that would work like the resource manager presented in this chapter. Unfortunately, due to the limitations on class time, the resource manager presented in that class was very basic — it handled only one type of .tar file (the POSIX standard one, not the GNU one, which has a slightly different format), it mounted only one .tar at a time, and — the most severe limitation — the .tar file could have files only in the “current” directory (i.e. it didn't support subdirectories).

This resource manager remedies all that.

Design

To understand the design of this resource manager, it's necessary to understand the basics of a .tar file, and the two common formats.

Way back in the dim dark early days of UNIX (I think it must have been in the roaring twenties or something), huge 30cm (1 foot) diameter 9-track magnetic tapes were all the rage. In order to transport data from one machine to another, or to make backups, many files would be archived onto these tapes. The name “TAR” stood for Tape ARchive, and the format of the .tar file was an attempt to accommodate tape media. The tapes worked best in block mode; rather than write individual bytes of data, the system would read or write 512-byte (or other sized) blocks of data. When you consider the technology of the day, it made a lot of sense — here was this moving magnetic tape that passed over a head at a certain speed. While the tape was moving, it was good to pump data to (or from) the tape. When the tape stopped, it would be awkward to position it exactly into the middle of a block of data, so most tape operations were block-oriented (with the notable exception of incremental tape drives, but we won't go into that here). A typical magnetic tape could hold tens to hundreds of megabytes of data, depending on the density (number of bits per inch).

The .tar format, therefore, was block-oriented, and completely sequential (as opposed to random access). For every file in the archive, there was a block-aligned header describing the file (permissions, name, owner, date and time, etc.), followed by one or more blocks consisting of the file's data.

Creating a `.tar` file

So following along in this command-line session, I'll show you the resulting .tar file:

# ls -la
total 73
drwxrwxr-x  2 root      root   4096 Aug 17 17:31 ./
drwxrwxrwt  4 root      root   4096 Aug 17 17:29 ../
-rw-rw-r--  1 root      root   1076 Jan 14  2003 io_read.c
-rw-rw-r--  1 root      root    814 Jan 12  2003 io_write.c
-rw-rw-r--  1 root      root   6807 Feb 03  2003 main.c
-rw-rw-r--  1 root      root  11883 Feb 03  2003 tarfs.c
-rw-rw-r--  1 root      root    683 Jan 12  2003 tarfs.h
-rw-rw-r--  1 root      root   6008 Jan 15  2003 tarfs_io_read.c

# tar cvf x.tar *
io_read.c
io_write.c
main.c
tarfs.c
tarfs.h
tarfs_io_read.c

# ls -l x.tar
-rw-rw-r--  1 root      root  40960 Aug 17 17:31 x.tar

Here I've taken some of the source files in a directory and created a .tar file (called x.tar) that ends up being 40960 bytes — a nice multiple of 512 bytes, as we'd expect.

Each of the files is prefixed by a header in the .tar file, followed by the file content, aligned to a 512-byte boundary.

This is what each header looks like:

Offset	Length	Field Name
0	100	name
100	8	mode
108	8	uid
116	8	gid
124	12	size
136	12	mtime
148	8	chksum
156	1	typeflag
157	100	linkname
257	6	magic
263	2	version
265	32	uname
297	32	gname
329	8	devmajor
337	8	devminor
345	155	prefix
500	11	filler

Here's a description of the fields that we're interested in for the filesystem (all fields are ASCII octal unless noted otherwise):

name: The name of the stored entity (plain ASCII).
mode: The mode: read, write, execute permissions, as well as what the entity is (a file, symlink, etc.).
uid: The user ID.
gid: The group ID.
size: The size of the resource (symlinks and links get a 0 size).
typeflag: POSIX says one thing, GNU says another. Under POSIX, this is supposed to be one of the single characters “g,” “x,” or “0.” Under GNU, this is one of the single ASCII digits zero through seven, or an ASCII NUL character, indicating different types of entities. Sigh — “The nice thing about standards is there are so many to choose from.”
mtime: The modification time.
linkname: The name of the file that this file is linked to (or blank if not linked), in plain ASCII.

We've skipped a bunch of fields, such as the checksum, because we don't need them for our filesystem. (For the checksum, for example, we're simply assuming that the file has been stored properly — in the vast majority of cases, it's not actually on an antique 9-track tape — so data integrity shouldn't be a problem!)

What I meant above by “ASCII octal” fields is that the value of the number is encoded as a sequence of ASCII digits in base 8. Really.

For example, here's the very first header in the sample x.tar that we created above (addresses on the left-hand side, as well as the dump contents, are in hexadecimal, with printable ASCII characters on the right-hand side):

0000000 69 6f 5f 72 65 61 64 2e 63 00 00 00 00 00 00 00 io_read.c.......
0000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000060 00 00 00 00 30 31 30 30 36 36 34 00 30 30 30 30 ....0100664.0000
0000070 30 30 30 00 30 30 30 30 30 30 30 00 30 30 30 30 000.0000000.0000
0000080 30 30 30 32 30 36 34 00 30 37 36 31 31 31 34 31 0002064.07611141
0000090 34 36 35 00 30 31 31 33 33 34 00 20 30 00 00 00 465.011334..0...
00000A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000100 00 75 73 74 61 72 20 20 00 72 6f 6f 74 00 00 00 .ustar...root...
0000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000120 00 00 00 00 00 00 00 00 00 72 6f 6f 74 00 00 00 .........root...
0000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00001A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00001B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00001C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00001D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00001E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00001F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

Here are the fields that we're interested in for our .tar filesystem:

Offset 0: The name of the entity, in this case io_read.c.
Offset 100 (0x64 hex): The ASCII octal digits 0100664 representing S_IFREG (indicating this is a regular file), with a permission mode of 0664 (indicating it's readable by everyone, but writable by the owner and group only).
Offset 108 (0x6C): The ASCII octal digits 0000000 representing the user ID (in this case, 0, or root).
Offset 116 (0x74): The ASCII octal digits 0000000 representing the group ID (0000000, or group 0).
Offset 124 (0x7C): The ASCII octal digits 00000002064 (or decimal 1076) representing the size of the entity. (This does present a limit to the file size of 77777777777 octal, or 8 gigabytes — not bad for something invented in the days of hard disks the size of washing machines with capacities of tens of megabytes!).
Offset 136 (0x88): The ASCII octal digits 07611141465 (or decimal 1042596661 seconds after January 1st, 1970, which really converts to “Tue Jan 14 21:11:01 EST 2003.”)

The one interesting wrinkle has to do with items that are in subdirectories.

Depending on how you invoked tar when the archive was created, you may or may not have the directories listed individually within the .tar file. What I mean is that if you add the file dir/spud.txt to the archive, the question is, is there a tar header corresponding to dir? In some cases there will be, in others there won't, so our .tar filesystem will need to be smart enough to create any intermediate directories that aren't explicitly mentioned in the headers.

Note that in all cases the full pathname is given; that is, we will always have dir/spud.txt. We never have a header for the directory for dir followed by a header for the file spud.txt; we'll always get the full path to each and every component listed in a .tar file.

Let's stop and think about how this resource manager compares to the RAM-disk resource manager in the previous chapter. If you squint your eyes a little bit, and ignore a few minor details, you can say they are almost identical. We need to:

set up data structures to hold directories, files, and symlinks
set up all the administration that goes along with managing these entities
search directories
satisfy the client's readdir() calls by returning directory entries
mount the root filesystem at arbitrary points.

The only thing that's really different is that instead of storing the file contents in RAM, we're storing them on disk! (The fact that this is a read-only filesystem isn't really a difference, it's a subset.)

The code

Let's now turn our attention to the code in the .tar filesystem. We'll begin by looking at the data structures that are different from the ones in the RAM disk, and then we'll look at the C modules that are different.

The structures

We still use an extended attributes structure in the .tar filesystem, because we still need to store additional information (like the contents of directories and symlinks). The main modification is the way we store files — we don't. We store only a reference to the file, indicating where in the .tar file the actual file begins, and its name. (The base attributes structure still stores the usual stuff: permissions, file times, etc., and most importantly, the size.)

typedef struct fes_s
{
    char                *name;
    off_t               off;
}   fes_t;

typedef struct cfs_attr_s
{
    iofunc_attr_t       attr;

    int                 nels;
    int                 nalloc;
    union {
        struct des_s    *dirblocks;
//      iov_t           *fileblocks;
        fes_t           vfile;
        char            *symlinkdata;
    } type;
} cfs_attr_t;

As you can see, the cfs_attr_t is almost identical to the RAM-disk structure. In fact, I left the fileblocks member commented-out to show the evolution from one source base to the next.

So our data for the file is now stored in a fes_t data type. All that the fes_t data type contains is the name of the .tar file (where the data is actually stored), and the offset into the .tar — we don't need to store the size of the file, because that's already stored normally in the plain (not extended) attributes structure.

The other difference from the RAM-disk filesystem is that we store the fes_t directly, rather than a pointer to it (as we did with the fileblocks IOV array). That's because the fes_t doesn't ever grow, and storing a 4-byte pointer to a tiny, fixed-size malloc()'d data region will take more space than just storing the 8-byte item directly.

The functions

Any functions that we don't mention here are either exactly the same as the RAM-disk version, or have such minor changes that they're not worth mentioning.

Overall operation is the same as the RAM-disk resource manager — we parse the command-line options, set up some data structures, and enter the resource manager main loop.

The c_mount() function

The first function called is the c_mount() function, which is responsible for accepting the name of a .tar file to operate on, and where to manifest its contents. There's a “mount helper” utility (the file m_main.c) that we'll look at later.

The beginning part of c_mount() is the same as the RAM disk, so we'll just point out the tail-end section that's different:

…

    // 1) allocate an attributes structure
    if (!(cfs_attr = malloc (sizeof (*cfs_attr)))) {
        return ENOMEM;
    }

    // 2) initialize it
    iofunc_attr_init (&cfs_attr -> attr,
                      S_IFDIR | 0555, NULL, NULL);
    cfs_attr_init (cfs_attr);
    cfs_attr -> attr.inode = (int) cfs_attr;
    cfs_a_mknod (cfs_attr, ".", S_IFDIR | 0555, NULL);
    cfs_a_mknod (cfs_attr, "..", S_IFDIR | 0555, NULL);

    // 3) load the tar file
    if (ret = analyze_tar_file (cfs_attr, spec_name)) {
        return (ret);
    }

    // 4) Attach the new pathname with the new value
    ret = resmgr_attach (dpp, &resmgr_attr, mnt_point,
                         _FTYPE_ANY, _RESMGR_FLAG_DIR,
                         &connect_func, &io_func,
                         &cfs_attr -> attr);
    if (ret == -1) {
        free (cfs_attr);
        return (errno);
    }
    return (EOK);
}

The code walkthrough is as follows:

Just like in the RAM disk, we initialize our attributes structure. (This is for synchronization with the RAM-disk source; after this point we diverge.)
The initialization is almost the same as the RAM disk, except we use mode 0555 because we are a read-only filesystem. We could lie and show 0777 if we wanted to, but we'd just run into grief later on (see below).
All of the tar-specific work is done in analyze_tar_file(), which we'll look at next. The function can return a non-zero value if it detects something it doesn't like — if that's the case, we just return it to the resource-manager framework.
Finally, we attach the mount point using resmgr_attr() as per normal (same as the RAM disk).

Part of the grief mentioned in step 2 above actually turns out to have a useful side-effect. If you were to reinstate the RAM-disk portion of the extended attributes structure — even though it's not implemented in the current filesystem manager — you could implement a somewhat “modifiable” .tar filesystem. If you ever went to write to a file, the resource manager would copy the data from the .tar version of the file, and then use the fileblocks member rather than the vfile member for subsequent operations. This might be a fairly simple way of making a few small modifications to an existing tar file, without the need to uncompress and untar the file. You'd then need to re-tar the entire directory structure to include your new changes. Try it and see!

The analyze_tar_file() function

At the highest level, the analyze_tar_function() function opens the .tar file, processes each file inside by calling add_tar_entry(), and then closes the .tar file. There's a wonderful library called zlib, which lets us open even compressed files and pretend that they are just normal, uncompressed files. That's what gives us the flexibility to open either a .tar or a .tar.gz file with no additional work on our part. (The limitation of the library is that seeking may be slow, because decompression may need to occur.)

int
analyze_tar_file (cfs_attr_t *a, char *fname)
{
    gzFile  fd;
    off_t   off;
    ustar_t t;
    int     size;
    int     sts;
    char    *f;

    // 1) the .tar (or .tar.gz) file must exist :-)
    if ((fd = gzopen (fname, "r")) == NULL) {
        return (errno);
    }

    off = 0;
    f = strdup (fname);

    // 2) read the 512-byte header into "t"
    while (gzread (fd, &t, sizeof (t)) > 0 && *t.name) {
        dump_tar_header (off, &t);

        // 3) get the size
        sscanf (t.size, "%o", &size);
        off += sizeof (t);

        // 4) add this entry to the database
        if (sts = add_tar_entry (a, off, &t, f)) {
            gzclose (fd);
            return (sts);
        }

        // 5) skip the data for the entry
        off += ((size + 511) / 512) * 512;
        gzseek (fd, off, SEEK_SET);
    }
    gzclose (fd);

    return (EOK);
}

The code walkthrough is:

The zlib library makes things look just like an fopen() call.
We read each header into the variable t, and optionally dump the header if case debug is enabled.
We read the ASCII octal size of the file following the header, then store it.
The real work is done in add_tar_entry().
The best part is that we skip the file content, which makes loading fast.

In step 5 we skip the file content. I'm surprised that not all of today's tar utilities do this when they're dealing with files — doing a tar tvf to get a listing of the tar file takes forever for huge files!

The add_tar_entry() function

Header analysis is done by the add_tar_entry() function. If this looks a little bit familiar it's because the code is based on the pathwalk() function from the RAM-disk resource manager, with logic added to process the tar file header.

static int
add_tar_entry (cfs_attr_t *a, off_t off, ustar_t *t, char *tarfile)
{
  des_t   output [_POSIX_PATH_MAX];
  int     nels;
  char    *p;
  int     i;
  int     mode;

  // 1) first entry is the mount point
  output [0].attr = a;
  output [0].name = NULL;

  // 2) break apart the pathname at the slashes
  nels = 1;
  for (p = strtok (t -> name, "/"); p; p = strtok (NULL, "/"), nels++) {
    if (nels >= _POSIX_PATH_MAX) {
      return (E2BIG);
    }
    output [nels].name = p;
    output [nels].attr = NULL;
  }

  // 3) analyze each pathname component
  for (i = 1; i < nels; i++) {

    // 4) sanity check
    if (!S_ISDIR (output [i - 1].attr -> attr.mode)) {
      return (ENOTDIR);       // effectively an internal error
    }

    // 5) check to see if the element exists...
    if (!(output [i].attr = search_dir (output [i].name,
                                        output [i-1].attr))) {
      mode = parse_mode (t);

      // 6) intermediate directory needs to be created...
      if (S_ISDIR (mode) || (i + 1 < nels)) {
        output [i].attr = cfs_a_mkdir (output [i - 1].attr,
                                       output [i].name, NULL);
        tar_to_attr (t, &output [i].attr -> attr);
        // kludge for implied "directories"
        if (S_ISREG (output [i].attr -> attr.mode)) {
          output [i].attr -> attr.mode =
            (output [i].attr -> attr.mode & ~S_IFREG) | S_IFDIR;
        }

      // 7) add a regular file
      } else if (S_ISREG (mode)) {
        output [i].attr = cfs_a_mkfile (output [i - 1].attr,
                                        output [i].name, NULL);
        tar_to_attr (t, &output [i].attr -> attr);
        output [i].attr -> nels = output [i].attr -> nalloc = 1;
        output [i].attr -> type.vfile.name = tarfile;
        output [i].attr -> type.vfile.off = off;

      // 8) add a symlink
      } else if (S_ISLNK (mode)) {
        output [i].attr = cfs_a_mksymlink (output [i - 1].attr,
                                           output [i].name, NULL);
        tar_to_attr (t, &output [i].attr -> attr);
        output [i].attr -> type.symlinkdata = strdup (t -> linkname);
        output [i].attr -> attr.nbytes = strlen (t -> linkname);

      } else {
        // code prints an error message here...
        return (EBADF);         // effectively an internal error
      }
    }
  }

  return (EOK);
}

The code walkthrough is:

Just as in the RAM disk's pathwalk(), the first entry in the output array is the mount point.
And, just as in the RAM disk's pathwalk(), we break apart the pathname at the / delimiters.
Then we use the for loop to analyze each and every pathname component.
This step is a basic sanity check — we're assuming that the .tar file came from a normal, sane, and self-consistent filesystem. If that's the case, then the parent of an entry will always be a directory; this check ensures that that's the case. (It's a small leftover from pathwalk().)
Next we need to see if the component exists. Just as in pathwalk(), we need to be able to verify that each component in the pathname exists, not only for sanity reasons, but also for implied intermediate directories.
In this case, the component does not exist, and yet there are further pathname components after it! This is the case of the implied intermediate directory; we simply create an intermediate directory and “pretend” that everything is just fine.
In this case, the component does not exist, and it's a regular file. We just go ahead and create the regular file. Note that we don't check to see if there are any components following this regular file, for two reasons. In a sane, consistent filesystem, this wouldn't happen anyway (so we're really not expecting this case in the .tar file). More importantly, that error will be caught in step 4 above, when we go to process the next entry and discover that its parent isn't a directory.
In this case, we create a symbolic link.

The io_read() function and related utilities

The .tar filesystem's io_read() is the standard one that we've seen in the RAM disk — it decides if the request is for a file or a directory, and calls the appropriate function.

The .tar filesystem's tarfs_io_read_dir() is the exact same thing as the RAM disk version — after all, the directory entry structures in the extended attributes structure are identical.

The only function that's different is the tarfs_io_read_file() function to read the data from the .tar file on disk.

int
tarfs_io_read_file (resmgr_context_t *ctp, io_read_t *msg,
iofunc_ocb_t *ocb)
{
  int     nbytes;
  int     nleft;
  iov_t   *iovs;
  int     niovs;
  int     i;
  int     pool_flag;
  gzFile  fd;

  // we don't do any xtypes here...
  if ((msg -> i.xtype & _IO_XTYPE_MASK) != _IO_XTYPE_NONE) {
    return (ENOSYS);
  }

  // figure out how many bytes are left
  nleft = ocb -> attr -> attr.nbytes - ocb -> offset;

  // and how many we can return to the client
  nbytes = min (nleft, msg -> i.nbytes);

  if (nbytes) {

    // 1) open the on-disk .tar file
    if ((fd = gzopen (ocb -> attr -> type.vfile.name, "r")) == NULL) {
      return (errno);
    }

    // 2) calculate number of IOVs required for transfer
    niovs = (nbytes + BLOCKSIZE - 1) / BLOCKSIZE;
    if (niovs <= 8) {
      iovs = mpool_malloc (mpool_iov8);
      pool_flag = 1;
    } else {
      iovs = malloc (sizeof (iov_t) * niovs);
      pool_flag = 0;
    }
    if (iovs == NULL) {
      gzclose (fd);
      return (ENOMEM);
    }

    // 3) allocate blocks for the transfer
    for (i = 0; i < niovs; i++) {
      SETIOV (&iovs [i], cfs_block_alloc (ocb -> attr), BLOCKSIZE);
      if (iovs [i].iov_base == NULL) {
        for (--i ; i >= 0; i--) {
          cfs_block_free (ocb -> attr, iovs [i].iov_base);
        }
        gzclose (fd);
        return (ENOMEM);
      }
    }

    // 4) trim last block to correctly read last entry in a .tar file
    if (nbytes & BLOCKSIZE) {
      iovs [niovs - 1].iov_len = nbytes & BLOCKSIZE;
    }

    // 5) get the data
    gzseek (fd, ocb -> attr -> type.vfile.off + ocb -> offset, SEEK_SET);
    for (i = 0; i < niovs; i++) {
      gzread (fd, iovs [i].iov_base, iovs [i].iov_len);
    }
    gzclose (fd);

    // return it to the client
    MsgReplyv (ctp -> rcvid, nbytes, iovs, i);

    // update flags and offset
    ocb -> attr -> attr.flags |= IOFUNC_ATTR_ATIME
                              | IOFUNC_ATTR_DIRTY_TIME;
    ocb -> offset += nbytes;
    for (i = 0; i < niovs; i++) {
      cfs_block_free (ocb -> attr, iovs [i].iov_base);
    }
    if (pool_flag) {
      mpool_free (mpool_iov8, iovs);
    } else {
      free (iovs);
    }
  } else {
    // nothing to return, indicate End Of File
    MsgReply (ctp -> rcvid, EOK, NULL, 0);
  }

  // already done the reply ourselves
  return (_RESMGR_NOREPLY);
}

Many of the steps here are common with the RAM disk version, so only steps 1 through 5 are documented here:

Notice that we keep the .tar on-disk file closed, and open it only as required. This is an area for improvement, in that you might find it slightly faster to have a certain cache of open .tar files, and maybe rotate them on an LRU-basis. We keep it closed so we don't run out of file descriptors; after all, you can mount hundreds (up to 1000 — a Neutrino limit) of .tar files with this resource manager (see the note below).
We're still dealing with blocks, just as we did in the RAM-disk filesystem, because we need a place to transfer the data from the disk file. We calculate the number of IOVs we're going to need for this transfer, and then allocate the iovs array.
Next, we call cfs_block_alloc() to get blocks from the block allocator, then we bind them to the iovs array. In case of a failure, we free all the blocks and fail ungracefully. A better failure mode would have been to shrink the client's request size to what we can handle, and return that. However, when you analyze this, the typical request size is 32k (8 blocks), and if we don't have 32k lying around, then we might have bigger troubles ahead.
The last block is probably not going to be exactly 4096 bytes in length, so we need to trim it. Nothing bad would happen if we were to gzread() the extra data into the end of the block — the client's transfer size is limited to the size of the resource stored in the attributes structure. So I'm just being extra paranoid.
And in this step, I'm being completely careless; we simply gzread() the data with no error-checking whatsoever into the blocks! :-)

The rest of the code is standard; return the buffer to the client via MsgReplyv(), update the access flags and offset, free the blocks and IOVs, etc.

In step 1, I mentioned a limit of 1000 open file descriptors. This limit is controlled by the -F parameter to procnto (the kernel). In version 6.2.1 of Neutrino, whatever value you pass to -F is the maximum (and default), and you cannot go higher than that value. In version 6.3 of Neutrino, whatever value you pass to -F is the default, and you can go higher. You can change the value (to be lower in 6.2.1, or to be lower or higher in 6.3) via the setrlimit() function, using the RLIMIT_NOFILE resource constant.

The mount helper program

One of the things that makes the .tar filesystem easy to use is the mount helper program. For example, I snarf newsgroups and store them in compressed .tar archives. So, I might have directories on my machine like /news/comp/os/qnx. As I accumulate articles, I store them in batches, say every 1000 articles. Within that directory, I might have files like 003xxx.tar.gz and 004xxx.tar.gz which are two compressed .tar files containing articles 3000 through 3999 and 4000 through 4999.

Using the .tar filesystem resource manager, I'd have to specify the following command lines to mount these two files:

mount -T tarfs /news/comp/os/qnx/003xxx.tar.gz 003xxx.tar.dir
mount -T tarfs /news/comp/os/qnx/004xxx.tar.gz 004xxx.tar.dir

That's not too bad for a handful of files, but it gets to be a real pain when you have hundreds of files.

The find fans will of course realize that it could be done “easily” using find:

find /news/comp/os/qnx -type f -name "*tar.gz" \
    -exec "mount -T tarfs {} {3}.dir"

Which is ever-so-slightly different (the {3}.dir creates an absolute path rather than a relative path, but that's okay).

However, for casual use, it would be really nice to just say:

mount_tarfs -m *.tar.gz

to mount all of the compressed archives in the current directory.

This is the job of the mount helper program. The mount helper is actually used in two modes. In one mode (as seen above) it's invoked standalone. In the second mode, it's invoked automatically by Neutrino's mount command:

mount -T tarfs source.tar[.gz] [dirname]

The code for the mount_tarfs is remarkably simple: main() does the usual call to the option-processing function, and then we determine if we are mounting a single file or multiple files. We optionally generate the target's extension if required, and call the library function mount().

Variations on a theme

The original RAM-disk filesystem was used to develop a framework that would support enough of the resource-manager functions to be useful. This framework was then adapted to the .tar filesystem, with very few changes.

In this section, I'd like to give you some further ideas about what can be done with the framework and with filesystems (virtual or otherwise) in general. There are several filesystems discussed here:

Virtual filesystem for USENET news (VFNews)
Strange and unusual filesystem
Secure filesystem
Line-based filesystem

Virtual filesystem for USENET news (VFNews)

I have created a virtual filesystem for USENET news (“VFNews”) under QNX 4, but I haven't ported it to Neutrino.

I'll describe what VFNews does and how it works. Most news services these days are all NNTP-based high-speed article-at-a-time services, rather than the traditional bulk systems that they used to be up until the mid 1990s, so the actual end-product is of limited interest. However, you'll see an interesting way of increasing the efficiency of a processing method by several orders of magnitude.

How does USENET news work?

People from around the world post messages (articles) to the various newsgroups. Their news system then distributes these articles to neighboring machines. Neighboring machines distribute the articles to their neighbors, and so on, until the articles have propagated all the way around the world. Machines check the incoming articles to see if they already have a copy, and quietly delete duplicates.

Historically, a program like cnews was responsible for the on-disk management of the news articles, and performed two major operations:

Get the news articles, and store them on disk
Delete old news articles.

Let's look at a typical system. As each article arrives (whether by UUCP, NNTP, or some other means), the article's “header” portion is scanned, and the news software determines where (i.e. into which newsgroups) that article should be stored.

A long time ago, when there wasn't all that much news traffic, it seemed like a good idea to just store one article per file. The newsgroup names got converted into pathnames, and everything was simple. For example, if I had an incoming article for comp.os.qnx, I would pick the next article number for that newsgroup (say 1143), and store the new article in a file called /var/spool/news/comp/os/qnx/1143. (The /var/spool/news part is just the name of the directory where all of the incoming news articles live — it's up to each site to determine which directory that is, but /var/spool/news is common.)

The next article that came in for that newsgroup would go into /var/spool/news/comp/os/qnx/1144, and so on.

So why is this a problem?

There are a number of reasons why this isn't an ideal solution. Articles can arrive in any order — we're not always going to get all of the articles for comp.os.qnx, then all of the articles for comp.os.rsx11, then comp.os.svr4, etc. This means that as the articles arrive, the news storage program is creating files in an ad-hoc manner, all over the disk. Not only that, it's creating from a few hundred thousand to many millions of files per day, depending on the size of the feed! (This works out to tens to hundreds or more files per second! The poor disk — and filesystem — is getting quite a workout.)

Given the volume of news these days, even terabyte-sized disks would fill up fairly quickly. So, all news systems have an expiry policy, which describes how long various articles hang around before being deleted. This is usually tuned based on the newsgroup, and can range from a few days to weeks or even months for low-traffic newsgroups. With current implementations, the expiry processing takes a significant amount of time; sometimes, the expiry processing will take so long that it appears that the machine is doing nothing but expiring!

The problem is that tons of files are being created in random places on the disk each second, and also roughly the same number of files being deleted, from different random places each second. This is suboptimal, and is exacerbated by the fact that each article even gets copied around a few times before ending up in its final location.

How can this possibly be made better?

As you are no doubt suspecting, we can do much better by knowing something about the problem domain, and putting our RAM-disk filesystem framework to good use.

One of the requirements, though, is that we still maintain the same directory structure as the old systems (i.e. the /var/spool/news structure).

Operation

As mentioned above, there are two main things that happen in the news-processing world: news comes in, and news expires.

The first trick is to realize that most news expires unread, and all news expires at some point. So by looking at the header for the article, we can determine when the article will expire, and place it in a file with all the other articles that will expire at the same time:

/var/spool/vfnews/20030804.news
/var/spool/vfnews/20030805.news
/var/spool/vfnews/20030806.news

Here we have three files, with the filenames representing the date that the batch of news articles expires, ranging from August 4, 2003 to August 6, 2003.

All we need to do is make a virtual filesystem that knows how to index into a real on-disk file, at a certain offset, for a certain length, and present those contents as if they were the real contents of the virtual file. Well, we've just done the exact same thing with the .tar filesystem! Effectively, (with a few optimizations) what we're doing is very similar to creating several different .tar files, and placing articles into those files. The files are named based on when they expire. When an article is added to the end of a file, we track its name (like /var/spool/news/comp/os/qnx/1145), the name of the bulk file we put it into, and its offset and size. (Some of the optimizations stem from the fact that we don't need to be 512-byte aligned, and that we don't need a header in the article's storage file.)

Layout of news bulk file

Articles from different newsgroups stored in a bulk file that expires on August 4, 2003.

When the time comes to expire old news, we simply remove the bulk file and any in-memory references to it.

This is the real beauty of this filesystem, and why we gain so much improvement from doing things this way than the original way. We've changed our disk access from writing tons of tiny files all over the disk to now writing sequentially to a few very large files. For expiration, we've done the same thing — we've changed from deleting tons of tiny files all over the disk to deleting one very large file when it expires. (The in-memory virtual filesystem deletion of the files is very fast; orders of magnitude faster than releasing blocks on disk.)

To give you an idea of the efficiency of this approach, consider that this system was running on a QNX 4 box in the early 1990s, with a 386 at 20 MHz, and a “full” news feed (19.2 kilobaud Trailblazer modem busy over 20 hours per day). When running with cnews, the hard disk was constantly thrashing. When running with VFNews, there was no disk activity, except every five seconds when a cache was flushed. In those days, system administrators would run cnews “expiry” once a day because of the overhead. I was able to run it once per hour with no noticeable impact on the CPU/disk! Also, ISPs would replace their hard disks every so often due to the amount of disk thrashing that was occurring.

Strange and unusual filesystems

Another very interesting thing that you can do with filesystems is completely abuse the assumptions about what a filesystem can do, and thus come up with strange and unusual filesystems. (Cue Twilight Zone theme music…)

The easiest entity to abuse is the symbolic link. Effectively, it's a back door into the filesystem. Since you control what happens in the c_link() entry point when the symlink is created, you control the interpretation of the symlink. This gives tremendous potential for abuse — as I like to quote Dr Seuss:

Then he got an idea!
An awful idea!
The Grinch got a wonderful, awful idea!

Indexed filesystem

For example, we could implement what can generally be called an indexed filesystem. This shares the characteristics of the .tar and VFNews filesystems. In an indexed filesystem, you create a symlink that contains the information about which filename we are accessing, the start offset, and the size.

For example:

# ln -s @/etc/data:1024:44 spud

This creates a symlink with the value @/etc/data:1024:44. If a regular filesystem tried to open this, it would yield an ENOENT immediately, as there is no path that starts with an at-sign.

However, since we control the c_link() function call, we can look at the value of the symlink, and determine that it's something special. The @ is our escape character. When you process the c_link() function call, instead of creating a symlink as you would normally, you can create anything you like. In this case, we'd create a plain, ordinary-looking file, that internally had the information that we passed to the symlink.

You can now see how this worked in the .tar filesystem — the information that we stored in the extended attributes structure told us the name of the on-disk file (the .tar or .tar.gz file) and the offset into that file where the data begins; the regular (non-extended) attributes structure gave us the length.

It's a “simple matter of programming” for you to do the same thing with the indexed filesystem. The only funny thing that happens, though, is that after we create the special symlink, the file looks like a regular file:

# ln -s @/etc/data:1024:44 spud
# ls -lF spud
-r--r--r--  1 root    root    44 Aug 16 17:41 spud

Normally, you'd expect to see:

# ls -lF spud
-r--r--r--  1 root  root  18 Aug 16 17:41 spud@ -> @/etc/data:1024:44

but since we converted the symlink to a plain file in the c_link() function, you'll never see that.

Executing commands

Don't despair, further abuses are possible!

We can create a symlink that has a ! or | as the first character. We can take this to mean, “Execute the following command and return the standard output of the command as the file content” (in the case of !) or “When writing data to the file, pipe it through the following command” (in the case of |).

Or, with the creative use of shell escape sequences, you can have the filename in the symlink actually be the name of a command to execute; the standard output of that command is the filename that gets used as the actual value of the symlink:

# ln -s \"redirector\" spud
# ls -l spud
-r--r--r--  1 root  root   44 Aug 16 17:41 spud@ -> /dev/server1
# ls -l spud
-r--r--r--  1 root  root   44 Aug 16 17:41 spud@ -> /dev/server2
# ls -l spud
-r--r--r--  1 root  root   44 Aug 16 17:41 spud@ -> /dev/server3

In this manner, you could implement some load-sharing functionality all by using what appears to be a “normal” symlink (the double quote character in "redirector" is what's used to tell the c_link() code that this command's standard output should be used as the value of the symlink). Our little program, redirector, simply has a printf() that outputs different server names.

And the fun doesn't end there, because when you're processing your c_link() handler, you have access to all of the client's information — so you can make decisions based on who the client is, what node they are based on, etc.

You're limited purely by your imagination and what you can get away with at a code review.

Secure filesystem

A more practical filesystem, however, is a secure (or encrypted) filesystem. In this filesystem, you use the underlying facilities of the disk-based filesystem for your backing store (the place where you actually store the data), and you present an encryption/decryption layer on top of the backing store's filesystem.

In essence, this is a modification of the .tar filesystem (in that we aren't actually storing the files in memory as in the RAM disk), with the added challenge that we are also allowing writes as well as file/directory creation.

An interesting facet of this filesystem would be to encrypt the filenames themselves. You could use something like the Rijndael/AES algorithm for all of your encryption, and then use base-64 encoding on the resulting (binary) filenames so that they can be stored by a conventional filesystem.

The reason you'd want to encrypt the filenames as well is to prevent “known plain-text” attacks. For example, if you didn't encrypt the filenames, an intruder could look at the files, and seeing a whole bunch of .wav files, could immediately deduce that there's a standard header present, giving them a portion of your file in plain text with which to begin their attack.

Line-based filesystem

Another potentially useful spin on filesystems is to make one that's explicitly able to manage text files. Whether you do it with a backing store (on-disk) implementation, or as a RAM disk, depends on your requirements. The characteristic of text files is that they are sequences of line-feed-separated entities, with the data on each line somewhat independent of the data on the previous or following lines. Consider a form-letter, where the only things that might change are the salutation, the account number, and perhaps an amount that's already overdue.

By creating a line-based filesystem, you could change the dynamic components of the form-letter without having to readjust all of the other bytes in the file. If you combine this with something like HTML or PostScript, you can start to see the potential.

The basis of line-based filesystems lies in the same technology used in text editors. An array of pointers-to-characters forms the basis of the “lines.” What the pointer points to is a complete line of text, and that's where you'd make your dynamic content plug in. The rest of the lines wouldn't need to change. By implementing it as an array of IOVs, you can pass this directly to the MsgReplyv() function that you'd use to return the data to the client in your io_read() handler.

References

Dr. Dobb's Journal, May 1996, “Improving USENET News Performance” describes the original design of the Virtual Filesystem for USENET News.

Header files

The following header files contain useful things for filesystem handlers:

<devctl.h>: Contains the definition for devctl(), and also defines the component flags used to create a command.
<dirent.h>: Contains the directory structure type used by readdir().
<sys/dcmd_blk.h>: Contains the FSYS devctl() block commands.
<sys/disk.h>: Defines partition_entry_t.
<sys/dispatch.h>, <sys/iofunc.h>: Used by resource managers.
<sys/fs_stats.h>: Defines the fs_stats structure returned by the FSYS block command DCMD_FSYS_STATISTICS.
<sys/mount.h>: Defines the mount flags.
<tar.h>: Defines the layout of the .tar header.
<zlib.h>: Contains the definitions for the transparent compression library.

Functions

See the following functions:

gzclose()
gzopen()
gzread()
gzseek()

as well as the following in the Neutrino C Library Reference:

_IO_SET_WRITE_NBYTES() in the entry for iofunc_write_verify()
iofunc_attr_init()
iofunc_check_access()
iofunc_client_info()
iofunc_ocb_attach()
iofunc_open()
iofunc_read_verify()
iofunc_write_verify()
MsgReply()
MsgReplyv()
resmgr_attach()
resmgr_msgreadv()
S_ISDIR(), S_ISREG(), and S_ISLNK() in the entry for stat()
SETIOV()
setrlimit()

Books

An excellent introduction to cryptography is Applied Cryptography — Protocols, Algorithms, and Source Code in C by Bruce Schneier (John Wiley & Sons, Inc., 1996, ISBN 0-471-11709-9)