Fine-Tuning Your System

This chapter includes:

Getting the system's status
Improving performance
Faster boot times
Filesystems and block I/O (devb-*) drivers
How small can you get?

Getting the system's status

Neutrino includes the following utilities that you can use to fine-tune your system:

hogs: List the processes that are hogging the CPU
pidin (Process ID INfo): Display system statistics
ps: Report process status
top: Display system usage (Unix)

For details about these utilities, see the Utilities Reference.

For more detailed and accurate data, use tracelogger and the System Analysis Toolkit (see the SAT User's Guide). The SAT logs kernel events, the changes to your system's state, using a specially instrumented version of the kernel (procnto*-instr).

If you have the Integrated Development Environment on your system, you'll find that it's the best tool for determining how you can improve your system's performance. For more information, see the IDE User's Guide.

Improving performance

If you run hogs, you'll get a rough idea of which processes are using the most CPU time. For example:

$ hogs -n -% 5
PID             NAME  MSEC  PIDS SYSTEM
1                     1315   53%    43%
6          devb-eide   593   24%    19%
54358061        make   206    8%     6%

1                     2026   83%    67%
6          devb-eide   294   12%     9%

1                     2391   75%    79%
6          devb-eide   335   10%    11%
54624301   htmlindex   249    7%     8%

1                     1004   24%    33%
54624301   htmlindex  2959   71%    98%

54624301   htmlindex  4156   96%   138%

54624301   htmlindex  4225   96%   140%

54624301   htmlindex  4162   96%   138%

1                       71   35%     2%
6          devb-eide    75   37%     2%

1                     3002   97%   100%

Let's look at this output. The first iteration indicates that process 1 is using 53% of the CPU. Process 1 is always the process manager, procnto. In this case, it's the idle thread that's using most of the CPU. The entry for devb-eide reflects disk I/O. The make utility is also using the CPU.

In the second iteration, procnto and devb-eide use most of the CPU, but the next few iterations show that htmlindex (a program that creates the keyword index for our online documentation) gets up to 96% of the CPU. When htmlindex finishes running, procnto and devb-eide use the CPU while the HTML files are written. Eventually, procnto — including the idle thread — gets almost all of the CPU.

You might be alarmed that htmlindex takes up to 96% of the CPU, but it's actually a good thing: if you're running only one program, it should get most of the CPU time.

If your system is running several processes at once, hogs could be more useful. It can tell you which of the processes is using the most CPU, and then you could adjust the priorities to favor the threads that are most important. (Remember that in Neutrino, priorities are a property of threads, not of processes.) For more information, see “Priorities” in the Using the Command Line chapter.

Here are some other tips to help you improve your system's performance:

You can use pidin to get information about the processes that are running on your system. For example, you can get the arguments used when starting the process, the state of the process's threads, and the memory that the process is using.
The number of threads doesn't effect system reaction time as much as the number of threads at a given priority. The key to performing realtime operations properly is to set up your realtime threads with the priorities required to ensure the system response that you need.
Do you need to run Photon? If not, you can prevent Photon from starting when you boot. Type:
```
touch /etc/system/config/nophoton
```
and reboot. This reduces the number of processes that the system runs when it starts.

Faster boot times

Here are a few tips to help you speed up booting:

If your system's setup is static, you can set up its device drivers yourself, instead of running the enumerators.
Remove as much as you can from the system-initialization files, and from the OS image if necessary.

For more information, see the Controlling How Neutrino Starts chapter in this guide.

Filesystems and block I/O (`devb-*`) drivers

Here are the basic steps to improving the performance of your filesystems and block I/O (devb-*) drivers:

Optimize disk hardware and driver options. This is most important on non-x86 targets and systems without hard drives (e.g. Microdrive, Compact Flash). Not using the fastest available DMA mode (or degrading to PIO) can easily affect the speed by a factor of ten. For more information, see Connecting Hardware.
Optimize the filesystem options:
- Determine how you want to balance system robustness and performance (see below).
- Concentrate on the cache and vnode (filesystem-independent inodes) options; the other sizes scale themselves to these.
- The default cache is 15% of the total system RAM, to a maximum of 512 MB. This is too large for floppy drivers (devb-fdc) and RAM drivers (devb-ram), but might be too small for intensive use.
- Set the commit option (either globally or as a mount option) to force or disable synchronous writes.
- Consider using a RAM disk for temporary files (e.g. /tmp).
Optimize application code:
- Read and write in large chunks (16–32 KB is optimal).
- Read and write in multiples of a disk block on block boundaries (typically 512 bytes, but you can use stat() or statvfs() to determine the value at runtime).
- Avoid standard I/O where possible; use open(), read(), and write(), instead of fopen(), fread(), and fwrite(). The f* functions use an extra layer of buffering. The default size is given by BUFSIZ; you can use setvbuf() to specify a different buffer size.
- Pregrow files, if you know their ultimate sizes.
- Use direct I/O (DMA to user space).
- Use filenames that are no longer than 16 characters. If you do this, the filesystem won't use the .inodes file, so there won't be any inter-block references. In addition, there will be one less disk write, and hence, one less chance of corruption if the power fails.
  Long filenames (i.e. longer than 48 characters) especially slow down the filesystem.
- Use the -i option to dinit to pregrow the .inodes file, which eliminates the runtime window of manipulating its metadata during a potential power loss.
- Big directories are slower that small ones, because the filesystem uses a linear search.

Performance and robustness

When you design or configure a filesystem, you have to balance performance and robustness:

Robustness involves synchronizing the user operations to the implementation of that operation to the successful response to the user.
For example, the creation of a new file — via creat() — may perform all the physical disk writes that are necessary to add that new filename into a directory on the disk filesystem and only then reply back to the client.
Performance may decouple the actual implementation of the operation from the reply.
For example, writing data into a file — via write() — might immediately reply to the client, but leave the data in a write-behind in-memory cache in an attempt to merge with later writes and construct a large, contiguous run for a single sequential disk access (but until that occurs, the data is vulnerable to loss if the power fails).

You must decide on the balance between robustness and performance that's appropriate for your installation, expectations, and requirements.

Metadata updates

Metadata is data about data, or all the overhead and attributes involved in storing the user data itself, such as the name of a file, the physical blocks it uses, modification and access timestamps, and so on.

The most expensive operation of a filesystem is in updating the metadata. This is because:

The metadata is typically located on different disk cylinders from the data and is even disjoint to itself (bitmaps, inodes, directory entries) and hence, incurs seek delays.
The metadata is usually written to the disk with more urgency than user data (because the metadata affects the integrity of the filesystem structure) and hence may incur a transfer delay.

Almost all operations on the filesystem (even reading file data, unless you've specified the noatime option — see io-blk.so in the Utilities Reference) involve some metadata updates.

Ordering the updates to metadata

Some filesystem operations affect multiple blocks on disk. For example, consider the situation of creating or deleting a file. Most filesystems separate the name of the file (or link) from the actual attributes of the file (the inode); this supports the POSIX concept of hard links, multiple names for the same file.

Typically, the inodes reside in a fixed location on disk (the .inodes file for fs-qnx4.so, or in the header of each cylinder group for fs-ext2.so).

Creating a new filename thus involves allocating a free inode entry and populating it with the details for the new file, and then placing the name for the file into the appropriate directory. Deleting a file involves removing the name from the parent directory and marking the inode as available.

These operations must be performed in this order to prevent corruption should there be a power failure between the two writes; note that for creation the inode should be allocated before the name, as a crash would result in an allocated inode that isn't referenced by any name (an “orphaned resource” that a filesystem's check procedure can later reclaim). If the operations were performed the other way around and a power failure occurred, the result would be a name that refers to a stale or invalid inode, which is undetectable as an error. A similar argument applies, in reverse, for file deletion.

For traditional filesystems, the only way of ordering these writes is to perform the first one (or, more generally, all but the last one of a multiple-block sequence) synchronously (i.e. immediately and waiting for I/O to complete before continuing). A synchronous write is very expensive, because it involves a disk-head seek, interrupts any active sequential disk streaming, and blocks the thread until the write has completed — potentially milliseconds of dead time.

Throughput

Another key point is the performance of sequential access to a file, or raw throughput, where a large amount of data is written to a file (or an entire file is read). The filesystem itself can detect this type of sequential access and attempt to optimize the use of the disk, by doing:

read-ahead on reads, so that the disk is being accessed for the predicted new data while the user processes the original data
write-behind of writes to allow a large amount of dirty data to be coalesced into a single contiguous multiple-block write

The most efficient way of accessing the disk for high-performance is through the standard POSIX routines that work with file descriptors — open(), read(), and write() — because these allow direct access to the filesystem with no interference from libc.

If you're concerned about performance, we don't recommend that you use the standard I/O (<stdio.h>) routines that work with FILE variables, because they introduce another layer of code and another layer of buffering. In particular, the default buffer size is BUFSIZ, or 1 KB, so all access to the disk is carved up into chunks of that size, causing a large amount of overhead for passing messages and switching contexts.

There are some cases when the standard I/O facilities are useful, such as when processing a text file one line or character at a time, in which case the 1 KB of buffering provided by standard I/O greatly reduces the number of messages to the filesystem. You can improve performance by using

setvbuf() to increase the buffering size
fileno() to access the underlying file descriptor directly and to bypass the buffering during performance-critical sections

You can also optimize performance by accessing the disk in suitably sized chunks (large enough to minimize the overheads of Neutrino's context-switching and message-passing, but not too large to exceed disk driver limits for blocks per operation or overheads in large message-passing); an optimal size is 32 KB.

You should also access the file on block boundaries for whole multiples of a disk sector (since the smallest unit of access to a disk/block device is a single sector, partial writes will require a read/modify/write cycle); you can get the optimal I/O size by calling statvfs(), although most disks are 512 bytes/sector.

Finally, for very high performance situations (video streaming, etc.) it's possible to bypass all buffering in the filesystem and perform DMA directly between the user data areas and the disk. But note these caveats:

The disk and disk driver must support such access.
No coherency is offered between data transferred directly and any data in the filesystem buffer cache.
Some POSIX semantics (such as file access or modification time updates) are ignored.

We don't currently recommend that you use DMA unless absolutely necessary; not all disk drivers correctly support it, so there's no facility to query a disk driver for the DMA-safe requirements of its interface, and naive users can get themselves into trouble!

In some situations, where you know the total size of the final data file, it can be advantageous to pregrow it to this size, rather than allow it to be automatically extended piecemeal by the filesystem as it is written to. This lets the filesystem see a single explicit request for allocation instead of many implicit incremental updates; some filesystems may be able to exploit this and allocate the file in a more optimal/contiguous fashion. It also reduces the number of metadata updates needed during the write phase, and so, improves the data write performance by not disrupting sequential streaming.

The POSIX function to extend a file is ftruncate(); the standard requires this function to zero-fill the new data space, meaning that the file is effectively written twice, so this technique is suitable when you can prepare the file during an initial phase where performance isn't critical. There's also a non-POSIX devctl() to extend a file without zero-filling it, which provides the above benefits without the cost of erasing the contents; see DCMD_FSYS_PREGROW_FILE in <sys/dcmd_blk.h>.

Configuration

You can control the balance between performance and robustness on either a global or per-file basis:

Specifying the O_SYNC bit when opening a file causes all I/O operations on that file (both data and metadata) to be performed synchronously.
The fsync() and sync() functions let you flush the filesystem write-behind cache on demand; otherwise, any dirty data is flushed from cache under the control of the global blk delwri= option (the default is two seconds — see io-blk.so in the Utilities Reference).

You control the global configuration with the commit= option, either to io-blk.so as an option to apply to all filesystems, or via the mount command as an option to apply to a single instance of a mounted filesystem). The levels are none, low, medium, and high, which differ in the degree in which metadata is written synchronously versus asynchronously, or even time-delayed.

At any level less robust than the default (i.e. medium), the filesystem doesn't guarantee the same level of integrity following an unexpected power loss, because multiple-block updates might not be ordered correctly.

The sections that follow illustrate the effects of different configurations on performance.

Block I/O `commit` level

This table illustrates how the commit= affects the time it takes to create and delete a file on an x86 PIII-450 machine with a UDMA-2 EIDE disk, running a QNX 4 filesystem. The table shows how many 0 KB files could be created and deleted per second:

`commit` level	Number created	Number deleted
`high`	866	1221
`medium`	1030	2703
`low`	1211	2710
`none`	1407	2718

Note that at the commit=high level, all disk writes are synchronous, so there's a noticeable cost in updating the directory entries and the POSIX mtime on the parent directory. At the commit=none level, all disk writes are time-delayed in the write-behind cache, and so multiple files can be created/deleted in the in-memory block without requiring any physical disk access at all (so, of course, any power failure here would mean that those files wouldn't exist when the system is restarted).

Record size

This example illustrates how the record size affects sequential file access on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table lists the rates, in megabytes per second, of writing and reading a 256 MB file:

Record size	Writing	Reading
1 KB	14	16
2 KB	16	19
4 KB	17	24
8 KB	18	30
16 KB	18	35
32 KB	19	36
64 KB	18	36
128 KB	17	37

Note that the sequential read rate doubles based on use of a suitable record size. This is because the overheads of context-switching and message-passing are reduced; consider that reading the 256 MB file 1 KB at a time requires 262,144 _IO_READ messages, whereas with 16 KB records, it requires only 16,384 such messages; 1/16th of the non-negligible overheads.

Write performance doesn't show the same dramatic change, because the user data is, by default, placed in the write-behind buffer cache and written in large contiguous runs under timer control — using O_SYNC would illustrate a difference. The limiting factor here is the periodic need for synchronous update of the bitmap and inode for block allocation as the file grows (see below for a case study or overwriting an already-allocated file).

Double buffering

This example illustrates the effect of double-buffering in the standard I/O library on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table shows the rate, in megabytes per second, of writing and reading a 256 MB file, with a record size of 8 KB:

Scenario	Writing	Reading
File descriptor	18	31
Standard I/O	13	16
setvbuf()	17	30

Here, you can see the effect of the default standard I/O buffer size (BUFSIZ, or 1 KB). When you ask it to transfer 8 KB, the library implements the transfer as 8 separate 1 KB operations. Note how the standard I/O case does match the above benchmark (see “Record size,” above) for a 1 KB record, and the file-descriptor case is the same as the 8 KB scenario).

When you use setvbuf() to force the standard I/O buffering up to the 8 KB record size, then the results come closer to the optimal file-descriptor case (the small difference is due to the extra code complexity and the additional memcpy() between the user data and the internal standard I/O FILE buffer).

File descriptor vs standard I/O

Here's another example that compares access using file descriptors and standard I/O on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table lists the rates, in megabytes per seconds, for writing and reading a 256 MB file, using file descriptors and standard I/O:

Record size	FD write	FD read	Stdio write	Stdio read
32	1.5	1.7	10.9	12.7
64	2.8	3.1	11.7	14.3
128	5.0	5.6	12.0	15.1
256	8.0	9.0	12.4	15.2
512	10.8	12.9	13.2	16.0
1024	14.1	16.9	13.1	16.3
2048	16.1	20.6	13.2	16.5
4096	17.1	24.0	13.9	16.5
8192	18.3	31.4	14.0	16.4
16384	18.1	37.3	14.3	16.4

Notice how the read() access is very sensitive to the record size; this is because each read() maps to an _IO_READ message and is basically a context-switch and message-pass to the filesystem; when only small amounts of data are transferred each time, the OS overhead becomes significant.

Since standard I/O access using fread() uses a 1 KB internal buffer, the number of _IO_READ messages remains constant, regardless of the user record size, and the throughput resembles that of the file-descriptor 1 KB access in all cases (with slight degradation at smaller record sizes due to the increased number of libc calls made). Thus, you should consider the anticipated file-access patterns when you choose from these I/O paradigms.

Pregrowing a file

This example illustrates the effect of pregrowing a data file on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table shows the times, in milliseconds, required to create and write a 256 MB file in 8 KB records:

Scenario:	Creation	Write	Total
write()	0	15073	15073 (15 seconds)
ftruncate()	13908	8510	22418 (22 seconds)
devctl()	55	8479	8534 (8.5 seconds)

Note how extending the file incrementally as a result of each write() call is slower than growing it with a single ftruncate() call, as the filesystem can allocate larger/contiguous data extents, and needs to update the inode metadata attributes only once. Note also how the time to overwrite already allocated data blocks is much less than that for allocating the blocks dynamically (the sequential writes aren't interrupted by the periodic need to synchronously update the bitmap).

Although the total time to pregrow and overwrite is worse than growing, the pregrowth could be performed during an initialization phase where speed isn't critical, allowing for better write performance later.

The optimal case is to pregrow the file without zero-filling it (using a devctl()) and then overwrite with the real data contents.

Fine-tuning USB storage devices

If your environment hosts large (e.g. media) files on USB storage devices, you should ensure that your configuration allows sufficient RAM for read-ahead processing of large files, such as MP3 files. You can change the configuration by adjusting the cache and vnode values that devb-umass passes to io-blk.so with the blk option.

A reasonable starting configuration for the blk option is: cache=512k,vnode=256. You should, however, establish benchmarks for key activities in your environment, and then adjust these values for optimal performance.

How small can you get?

The best way to reduce the size of your system is to use our IDE to create an OS image. The System Builder perspective includes a tool called the Dietician that can help “slim down” the libraries included in the image. For more information, see the IDE User's Guide, as well as Building Embedded Systems.