This chapter includes:
Neutrino includes the following utilities that you can use to fine-tune your system:
For details about these utilities, see the Utilities Reference.
For more detailed and accurate data, use tracelogger and the System Analysis Toolkit (see the SAT User's Guide). The SAT logs kernel events, the changes to your system's state, using a specially instrumented version of the kernel (procnto*-instr).
|If you have the Integrated Development Environment on your system, you'll find that it's the best tool for determining how you can improve your system's performance. For more information, see the IDE User's Guide.|
If you run hogs, you'll get a rough idea of which processes are using the most CPU time. For example:
$ hogs -n -% 5 PID NAME MSEC PIDS SYSTEM 1 1315 53% 43% 6 devb-eide 593 24% 19% 54358061 make 206 8% 6% 1 2026 83% 67% 6 devb-eide 294 12% 9% 1 2391 75% 79% 6 devb-eide 335 10% 11% 54624301 htmlindex 249 7% 8% 1 1004 24% 33% 54624301 htmlindex 2959 71% 98% 54624301 htmlindex 4156 96% 138% 54624301 htmlindex 4225 96% 140% 54624301 htmlindex 4162 96% 138% 1 71 35% 2% 6 devb-eide 75 37% 2% 1 3002 97% 100%
Let's look at this output. The first iteration indicates that process 1 is using 53% of the CPU. Process 1 is always the process manager, procnto. In this case, it's the idle thread that's using most of the CPU. The entry for devb-eide reflects disk I/O. The make utility is also using the CPU.
In the second iteration, procnto and devb-eide use most of the CPU, but the next few iterations show that htmlindex (a program that creates the keyword index for our online documentation) gets up to 96% of the CPU. When htmlindex finishes running, procnto and devb-eide use the CPU while the HTML files are written. Eventually, procnto — including the idle thread — gets almost all of the CPU.
You might be alarmed that htmlindex takes up to 96% of the CPU, but it's actually a good thing: if you're running only one program, it should get most of the CPU time.
If your system is running several processes at once, hogs could be more useful. It can tell you which of the processes is using the most CPU, and then you could adjust the priorities to favor the threads that are most important. (Remember that in Neutrino, priorities are a property of threads, not of processes.) For more information, see “Priorities” in the Using the Command Line chapter.
Here are some other tips to help you improve your system's performance:
and reboot. This reduces the number of processes that the system runs when it starts.
Here are a few tips to help you speed up booting:
For more information, see the Controlling How Neutrino Starts chapter in this guide.
Here are the basic steps to improving the performance of your filesystems and block I/O (devb-*) drivers:
Long filenames (i.e. longer than 48 characters) especially slow down the filesystem.
When you design or configure a filesystem, you have to balance performance and robustness:
For example, the creation of a new file — via creat() — may perform all the physical disk writes that are necessary to add that new filename into a directory on the disk filesystem and only then reply back to the client.
For example, writing data into a file — via write() — might immediately reply to the client, but leave the data in a write-behind in-memory cache in an attempt to merge with later writes and construct a large, contiguous run for a single sequential disk access (but until that occurs, the data is vulnerable to loss if the power fails).
You must decide on the balance between robustness and performance that's appropriate for your installation, expectations, and requirements.
Metadata is data about data, or all the overhead and attributes involved in storing the user data itself, such as the name of a file, the physical blocks it uses, modification and access timestamps, and so on.
The most expensive operation of a filesystem is in updating the metadata. This is because:
Almost all operations on the filesystem (even reading file data, unless you've specified the noatime option — see io-blk.so in the Utilities Reference) involve some metadata updates.
Some filesystem operations affect multiple blocks on disk. For example, consider the situation of creating or deleting a file. Most filesystems separate the name of the file (or link) from the actual attributes of the file (the inode); this supports the POSIX concept of hard links, multiple names for the same file.
Typically, the inodes reside in a fixed location on disk (the .inodes file for fs-qnx4.so, or in the header of each cylinder group for fs-ext2.so).
Creating a new filename thus involves allocating a free inode entry and populating it with the details for the new file, and then placing the name for the file into the appropriate directory. Deleting a file involves removing the name from the parent directory and marking the inode as available.
These operations must be performed in this order to prevent corruption should there be a power failure between the two writes; note that for creation the inode should be allocated before the name, as a crash would result in an allocated inode that isn't referenced by any name (an “orphaned resource” that a filesystem's check procedure can later reclaim). If the operations were performed the other way around and a power failure occurred, the result would be a name that refers to a stale or invalid inode, which is undetectable as an error. A similar argument applies, in reverse, for file deletion.
For traditional filesystems, the only way of ordering these writes is to perform the first one (or, more generally, all but the last one of a multiple-block sequence) synchronously (i.e. immediately and waiting for I/O to complete before continuing). A synchronous write is very expensive, because it involves a disk-head seek, interrupts any active sequential disk streaming, and blocks the thread until the write has completed — potentially milliseconds of dead time.
Another key point is the performance of sequential access to a file, or raw throughput, where a large amount of data is written to a file (or an entire file is read). The filesystem itself can detect this type of sequential access and attempt to optimize the use of the disk, by doing:
The most efficient way of accessing the disk for high-performance is through the standard POSIX routines that work with file descriptors — open(), read(), and write() — because these allow direct access to the filesystem with no interference from libc.
If you're concerned about performance, we don't recommend that you use the standard I/O (<stdio.h>) routines that work with FILE variables, because they introduce another layer of code and another layer of buffering. In particular, the default buffer size is BUFSIZ, or 1 KB, so all access to the disk is carved up into chunks of that size, causing a large amount of overhead for passing messages and switching contexts.
There are some cases when the standard I/O facilities are useful, such as when processing a text file one line or character at a time, in which case the 1 KB of buffering provided by standard I/O greatly reduces the number of messages to the filesystem. You can improve performance by using
You can also optimize performance by accessing the disk in suitably sized chunks (large enough to minimize the overheads of Neutrino's context-switching and message-passing, but not too large to exceed disk driver limits for blocks per operation or overheads in large message-passing); an optimal size is 32 KB.
You should also access the file on block boundaries for whole multiples of a disk sector (since the smallest unit of access to a disk/block device is a single sector, partial writes will require a read/modify/write cycle); you can get the optimal I/O size by calling statvfs(), although most disks are 512 bytes/sector.
Finally, for very high performance situations (video streaming, etc.) it's possible to bypass all buffering in the filesystem and perform DMA directly between the user data areas and the disk. But note these caveats:
We don't currently recommend that you use DMA unless absolutely necessary; not all disk drivers correctly support it, so there's no facility to query a disk driver for the DMA-safe requirements of its interface, and naive users can get themselves into trouble!
In some situations, where you know the total size of the final data file, it can be advantageous to pregrow it to this size, rather than allow it to be automatically extended piecemeal by the filesystem as it is written to. This lets the filesystem see a single explicit request for allocation instead of many implicit incremental updates; some filesystems may be able to exploit this and allocate the file in a more optimal/contiguous fashion. It also reduces the number of metadata updates needed during the write phase, and so, improves the data write performance by not disrupting sequential streaming.
The POSIX function to extend a file is ftruncate(); the standard requires this function to zero-fill the new data space, meaning that the file is effectively written twice, so this technique is suitable when you can prepare the file during an initial phase where performance isn't critical. There's also a non-POSIX devctl() to extend a file without zero-filling it, which provides the above benefits without the cost of erasing the contents; see DCMD_FSYS_PREGROW_FILE in <sys/dcmd_blk.h>.
You can control the balance between performance and robustness on either a global or per-file basis:
The fsync() and sync() functions let you flush the filesystem write-behind cache on demand; otherwise, any dirty data is flushed from cache under the control of the global blk delwri= option (the default is two seconds — see io-blk.so in the Utilities Reference).
|At any level less robust than the default (i.e. medium), the filesystem doesn't guarantee the same level of integrity following an unexpected power loss, because multiple-block updates might not be ordered correctly.|
The sections that follow illustrate the effects of different configurations on performance.
This table illustrates how the commit= affects the time it takes to create and delete a file on an x86 PIII-450 machine with a UDMA-2 EIDE disk, running a QNX 4 filesystem. The table shows how many 0 KB files could be created and deleted per second:
|commit level||Number created||Number deleted|
Note that at the commit=high level, all disk writes are synchronous, so there's a noticeable cost in updating the directory entries and the POSIX mtime on the parent directory. At the commit=none level, all disk writes are time-delayed in the write-behind cache, and so multiple files can be created/deleted in the in-memory block without requiring any physical disk access at all (so, of course, any power failure here would mean that those files wouldn't exist when the system is restarted).
This example illustrates how the record size affects sequential file access on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table lists the rates, in megabytes per second, of writing and reading a 256 MB file:
Note that the sequential read rate doubles based on use of a suitable record size. This is because the overheads of context-switching and message-passing are reduced; consider that reading the 256 MB file 1 KB at a time requires 262,144 _IO_READ messages, whereas with 16 KB records, it requires only 16,384 such messages; 1/16th of the non-negligible overheads.
Write performance doesn't show the same dramatic change, because the user data is, by default, placed in the write-behind buffer cache and written in large contiguous runs under timer control — using O_SYNC would illustrate a difference. The limiting factor here is the periodic need for synchronous update of the bitmap and inode for block allocation as the file grows (see below for a case study or overwriting an already-allocated file).
This example illustrates the effect of double-buffering in the standard I/O library on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table shows the rate, in megabytes per second, of writing and reading a 256 MB file, with a record size of 8 KB:
Here, you can see the effect of the default standard I/O buffer size (BUFSIZ, or 1 KB). When you ask it to transfer 8 KB, the library implements the transfer as 8 separate 1 KB operations. Note how the standard I/O case does match the above benchmark (see “Record size,” above) for a 1 KB record, and the file-descriptor case is the same as the 8 KB scenario).
When you use setvbuf() to force the standard I/O buffering up to the 8 KB record size, then the results come closer to the optimal file-descriptor case (the small difference is due to the extra code complexity and the additional memcpy() between the user data and the internal standard I/O FILE buffer).
Here's another example that compares access using file descriptors and standard I/O on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table lists the rates, in megabytes per seconds, for writing and reading a 256 MB file, using file descriptors and standard I/O:
|Record size||FD write||FD read||Stdio write||Stdio read|
Notice how the read() access is very sensitive to the record size; this is because each read() maps to an _IO_READ message and is basically a context-switch and message-pass to the filesystem; when only small amounts of data are transferred each time, the OS overhead becomes significant.
Since standard I/O access using fread() uses a 1 KB internal buffer, the number of _IO_READ messages remains constant, regardless of the user record size, and the throughput resembles that of the file-descriptor 1 KB access in all cases (with slight degradation at smaller record sizes due to the increased number of libc calls made). Thus, you should consider the anticipated file-access patterns when you choose from these I/O paradigms.
This example illustrates the effect of pregrowing a data file on an x86 PIII-725 machine with a UDMA-4 EIDE disk, using the QNX 4 filesystem. The table shows the times, in milliseconds, required to create and write a 256 MB file in 8 KB records:
|write()||0||15073||15073 (15 seconds)|
|ftruncate()||13908||8510||22418 (22 seconds)|
|devctl()||55||8479||8534 (8.5 seconds)|
Note how extending the file incrementally as a result of each write() call is slower than growing it with a single ftruncate() call, as the filesystem can allocate larger/contiguous data extents, and needs to update the inode metadata attributes only once. Note also how the time to overwrite already allocated data blocks is much less than that for allocating the blocks dynamically (the sequential writes aren't interrupted by the periodic need to synchronously update the bitmap).
Although the total time to pregrow and overwrite is worse than growing, the pregrowth could be performed during an initialization phase where speed isn't critical, allowing for better write performance later.
The optimal case is to pregrow the file without zero-filling it (using a devctl()) and then overwrite with the real data contents.
If your environment hosts large (e.g. media) files on USB storage devices, you should ensure that your configuration allows sufficient RAM for read-ahead processing of large files, such as MP3 files. You can change the configuration by adjusting the cache and vnode values that devb-umass passes to io-blk.so with the blk option.
A reasonable starting configuration for the blk option is: cache=512k,vnode=256. You should, however, establish benchmarks for key activities in your environment, and then adjust these values for optimal performance.
The best way to reduce the size of your system is to use our IDE to create an OS image. The System Builder perspective includes a tool called the Dietician that can help “slim down” the libraries included in the image. For more information, see the IDE User's Guide, as well as Building Embedded Systems.