Fine-Tuning Your System

This chapter includes:

Getting the system's status

Neutrino includes the following utilities that you can use to fine-tune your system:

hogs
List the processes that are hogging the CPU
pidin (Process ID INfo)
Display system statistics
ps
Report process status
top
Display system usage (Unix)

For details about these utilities, see the Utilities Reference.

For more detailed and accurate data, use tracelogger and the System Analysis Toolkit (see the SAT User's Guide). The SAT logs kernel events, the changes to your system's state, using a specially instrumented version of the kernel (procnto*-instr).


Note: If you have the Integrated Development Environment on your system, you'll find that it's the best tool for determining how you can improve your system's performance. For more information, see the IDE User's Guide.

Improving performance

If you run hogs, you'll get a rough idea of which processes are using the most CPU time. For example:

$ hogs -n -% 5
PID             NAME  MSEC  PIDS SYSTEM
1                     1315   53%    43%
6          devb-eide   593   24%    19%
54358061        make   206    8%     6%

1                     2026   83%    67%
6          devb-eide   294   12%     9%

1                     2391   75%    79%
6          devb-eide   335   10%    11%
54624301   htmlindex   249    7%     8%

1                     1004   24%    33%
54624301   htmlindex  2959   71%    98%

54624301   htmlindex  4156   96%   138%

54624301   htmlindex  4225   96%   140%

54624301   htmlindex  4162   96%   138%

1                       71   35%     2%
6          devb-eide    75   37%     2%

1                     3002   97%   100%

Let's look at this output. The first iteration indicates that process 1 is using 53% of the CPU. Process 1 is always the process manager, procnto. In this case, it's the idle thread that's using most of the CPU. The entry for devb-eide reflects disk I/O. The make utility is also using the CPU.

In the second iteration, procnto and devb-eide use most of the CPU, but the next few iterations show that htmlindex (a program that creates the keyword index for our online documentation) gets up to 96% of the CPU. When htmlindex finishes running, procnto and devb-eide use the CPU while the HTML files are written. Eventually, procnto — including the idle thread — gets almost all of the CPU.

You might be alarmed that htmlindex takes up to 96% of the CPU, but it's actually a good thing: if you're running only one program, it should get most of the CPU time.

If your system is running several processes at once, hogs could be more useful. It can tell you which of the processes is using the most CPU, and then you could adjust the priorities to favor the threads that are most important. (Remember that in Neutrino, priorities are a property of threads, not of processes.) For more information, see Priorities in the Using the Command Line chapter.

Here are some other tips to help you improve your system's performance:

Faster boot times

Here are a few tips to help you speed up booting:

For more information, see the Controlling How Neutrino Starts chapter in this guide.

Filesystems and block I/O (devb-*) drivers

Here are the basic steps to improving the performance of your filesystems and block I/O (devb-*) drivers:

  1. Optimize disk hardware and driver options. This is most important on non-x86 targets and systems without hard drives (e.g. Microdrive, Compact Flash). Not using the fastest available DMA mode (or degrading to PIO) can easily affect the speed by a factor of ten. For more information, see Connecting Hardware.
  2. Optimize the filesystem options:
  3. Optimize application code:

Performance and robustness

When you design or configure a filesystem, you have to balance performance and robustness:

You must decide on the balance between robustness and performance that's appropriate for your installation, expectations, and requirements.

Metadata updates

Metadata is data about data, or all the overhead and attributes involved in storing the user data itself, such as the name of a file, the physical blocks it uses, modification and access timestamps, and so on.

The most expensive operation of a filesystem is in updating the metadata. This is because:

Almost all operations on the filesystem (even reading file data, unless you've specified the noatime option — see io-blk.so in the Utilities Reference) involve some metadata updates.

Ordering the updates to metadata

Some filesystem operations affect multiple blocks on disk. For example, consider the situation of creating or deleting a file. Most filesystems separate the name of the file (or link) from the actual attributes of the file (the inode); this supports the POSIX concept of hard links, multiple names for the same file.

Typically, the inodes reside in a fixed location on disk (the .inodes file for fs-qnx4.so, or in the header of each cylinder group for fs-ext2.so).

Creating a new filename thus involves allocating a free inode entry and populating it with the details for the new file, and then placing the name for the file into the appropriate directory. Deleting a file involves removing the name from the parent directory and marking the inode as available.

These operations must be performed in this order to prevent corruption should there be a power failure between the two writes; note that for creation the inode should be allocated before the name, as a crash would result in an allocated inode that isn't referenced by any name (an “orphaned resource” that a filesystem's check procedure can later reclaim). If the operations were performed the other way around and a power failure occurred, the result would be a name that refers to a stale or invalid inode, which is undetectable as an error. A similar argument applies, in reverse, for file deletion.

For traditional filesystems, the only way of ordering these writes is to perform the first one (or, more generally, all but the last one of a multiple-block sequence) synchronously (i.e. immediately and waiting for I/O to complete before continuing). A synchronous write is very expensive, because it involves a disk-head seek, interrupts any active sequential disk streaming, and blocks the thread until the write has completed — potentially milliseconds of dead time.

Throughput

Another key point is the performance of sequential access to a file, or raw throughput, where a large amount of data is written to a file (or an entire file is read). The filesystem itself can detect this type of sequential access and attempt to optimize the use of the disk, by doing:

The most efficient way of accessing the disk for high-performance is through the standard POSIX routines that work with file descriptors — open(), read(), and write() — because these allow direct access to the filesystem with no interference from libc.

If you're concerned about performance, we don't recommend that you use the standard I/O (<stdio.h>) routines that work with FILE variables, because they introduce another layer of code and another layer of buffering. In particular, the default buffer size is BUFSIZ, or 1 KB, so all access to the disk is carved up into chunks of that size, causing a large amount of overhead for passing messages and switching contexts.

There are some cases when the standard I/O facilities are useful, such as when processing a text file one line or character at a time, in which case the 1 KB of buffering provided by standard I/O greatly reduces the number of messages to the filesystem. You can improve performance by using

You can also optimize performance by accessing the disk in suitably sized chunks (large enough to minimize the overheads of Neutrino's context-switching and message-passing, but not too large to exceed disk driver limits for blocks per operation or overheads in large message-passing); an optimal size is 32 KB.

You should also access the file on block boundaries for whole multiples of a disk sector (since the smallest unit of access to a disk/block device is a single sector, partial writes will require a read/modify/write cycle); you can get the optimal I/O size by calling statvfs(), although most disks are 512 bytes/sector.

Finally, for very high performance situations (video streaming, etc.) it's possible to bypass all buffering in the filesystem and perform DMA directly between the user data areas and the disk. But note these caveats:

We don't currently recommend that you use DMA unless absolutely necessary; not all disk drivers correctly support it, so there's no facility to query a disk driver for the DMA-safe requirements of its interface, and naive users can get themselves into trouble!

In some situations, where you know the total size of the final data file, it can be advantageous to pregrow it to this size, rather than allow it to be automatically extended piecemeal by the filesystem as it is written to. This lets the filesystem see a single explicit request for allocation instead of many implicit incremental updates; some filesystems may be able to exploit this and allocate the file in a more optimal/contiguous fashion. It also reduces the number of metadata updates needed during the write phase, and so, improves the data write performance by not disrupting sequential streaming.

The POSIX function to extend a file is ftruncate(); the standard requires this function to zero-fill the new data space, meaning that the file is effectively written twice, so this technique is suitable when you can prepare the file during an initial phase where performance isn't critical. There's also a non-POSIX devctl() to extend a file without zero-filling it, which provides the above benefits without the cost of erasing the contents; the DCMD_FSYS_PREGROW_FILE command, which is defined in <sys/dcmd_blk.h>, takes as its argument the file size, as a off64_t. For example:

int fd;
off64_t sz;

fd=open(...);
sz=...;

devctl(fd, DCMD_FSYS_PREGROW_FILE, &sz, sizeof(sz), NULL);

Configuration

You can control the balance between performance and robustness on either a global or per-file basis:

The sections that follow illustrate the effects of different configurations on performance.

Block I/O commit level

This table illustrates how the commit= affects the time it takes to create and delete a file on an x86 PIII-450 machine with a UDMA-2 EIDE disk, running a QNX 4 filesystem. The table shows how many 0 KB files could be created and deleted per second:

commit level Number created Number deleted
high 866 1221
medium 1030 2703
low 1211 2710
none 1407 2718

Note that at the commit=high level, all disk writes are synchronous, so there's a noticeable cost in updating the directory entries and the POSIX mtime on the parent directory. At the commit=none level, all disk writes are time-delayed in the write-behind cache, and so multiple files can be created/deleted in the in-memory block without requiring any physical disk access at all (so, of course, any power failure here would mean that those files wouldn't exist when the system is restarted).

Record size

This example illustrates how the record size affects sequential file access on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table lists the rates, in megabytes per second, of writing and reading a 256 MB file:

Record size Writing Reading
1 KB 14 16
2 KB 16 19
4 KB 17 24
8 KB 18 30
16 KB 18 35
32 KB 19 36
64 KB 18 36
128 KB 17 37

Note that the sequential read rate doubles based on use of a suitable record size. This is because the overheads of context-switching and message-passing are reduced; consider that reading the 256 MB file 1 KB at a time requires 262,144 _IO_READ messages, whereas with 16 KB records, it requires only 16,384 such messages; 1/16th of the non-negligible overheads.

Write performance doesn't show the same dramatic change, because the user data is, by default, placed in the write-behind buffer cache and written in large contiguous runs under timer control — using O_SYNC would illustrate a difference. The limiting factor here is the periodic need for synchronous update of the bitmap and inode for block allocation as the file grows (see below for a case study or overwriting an already-allocated file).

Double buffering

This example illustrates the effect of double-buffering in the standard I/O library on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table shows the rate, in megabytes per second, of writing and reading a 256 MB file, with a record size of 8 KB:

Scenario Writing Reading
File descriptor 18 31
Standard I/O 13 16
setvbuf() 17 30

Here, you can see the effect of the default standard I/O buffer size (BUFSIZ, or 1 KB). When you ask it to transfer 8 KB, the library implements the transfer as 8 separate 1 KB operations. Note how the standard I/O case does match the above benchmark (see Record size,” above) for a 1 KB record, and the file-descriptor case is the same as the 8 KB scenario).

When you use setvbuf() to force the standard I/O buffering up to the 8 KB record size, then the results come closer to the optimal file-descriptor case (the small difference is due to the extra code complexity and the additional memcpy() between the user data and the internal standard I/O FILE buffer).

File descriptor vs standard I/O

Here's another example that compares access using file descriptors and standard I/O on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table lists the rates, in megabytes per seconds, for writing and reading a 256 MB file, using file descriptors and standard I/O:

Record size FD write FD read Stdio write Stdio read
32 1.5 1.7 10.9 12.7
64 2.8 3.1 11.7 14.3
128 5.0 5.6 12.0 15.1
256 8.0 9.0 12.4 15.2
512 10.8 12.9 13.2 16.0
1024 14.1 16.9 13.1 16.3
2048 16.1 20.6 13.2 16.5
4096 17.1 24.0 13.9 16.5
8192 18.3 31.4 14.0 16.4
16384 18.1 37.3 14.3 16.4

Notice how the read() access is very sensitive to the record size; this is because each read() maps to an _IO_READ message and is basically a context-switch and message-pass to the filesystem; when only small amounts of data are transferred each time, the OS overhead becomes significant.

Since standard I/O access using fread() uses a 1 KB internal buffer, the number of _IO_READ messages remains constant, regardless of the user record size, and the throughput resembles that of the file-descriptor 1 KB access in all cases (with slight degradation at smaller record sizes due to the increased number of libc calls made). Thus, you should consider the anticipated file-access patterns when you choose from these I/O paradigms.

Pregrowing a file

This example illustrates the effect of pregrowing a data file on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table shows the times, in milliseconds, required to create and write a 256 MB file in 8 KB records:

Scenario: Creation Write Total
write() 0 15073 15073 (15 seconds)
ftruncate() 13908 8510 22418 (22 seconds)
devctl() 55 8479 8534 (8.5 seconds)

Note how extending the file incrementally as a result of each write() call is slower than growing it with a single ftruncate() call, as the filesystem can allocate larger/contiguous data extents, and needs to update the inode metadata attributes only once. Note also how the time to overwrite already allocated data blocks is much less than that for allocating the blocks dynamically (the sequential writes aren't interrupted by the periodic need to synchronously update the bitmap).

Although the total time to pregrow and overwrite is worse than growing, the pregrowth could be performed during an initialization phase where speed isn't critical, allowing for better write performance later.

The optimal case is to pregrow the file without zero-filling it (using a devctl()) and then overwrite with the real data contents.

Fine-tuning USB storage devices

If your environment hosts large (e.g. media) files on USB storage devices, you should ensure that your configuration allows sufficient RAM for read-ahead processing of large files, such as MP3 files. You can change the configuration by adjusting the cache and vnode values that devb-umass passes to io-blk.so with the blk option.

A reasonable starting configuration for the blk option is: cache=512k,vnode=256. You should, however, establish benchmarks for key activities in your environment, and then adjust these values for optimal performance.

How small can you get?

The best way to reduce the size of your system is to use our IDE to create an OS image. The System Builder perspective includes a tool called the Dietician that can help “slim down” the libraries included in the image. For more information, see the IDE User's Guide, as well as Building Embedded Systems.