Tuning high-performance NIC drivers

Common examples of high-performance NICs include:

Intel i82544: devn-i82544.so
Intel 82557/558/559: devn-speedo.so
Tulip (DEC 21x4x): devn-tulip.so
SMC 9432: devn-epic.so

These NICs are all at least 10/100 Mbit, and some are gigabit, but what makes them high-performance is their ability to function independently of the CPU and use the large CPU main memory for packet buffering, which the low performance NICs by design can't.

There are two critical data-transfer interfaces to a high-performance NIC, which you must tune correctly to avoid packet loss under load:

The first is the (usually PCI) bus itself. The high-performance NIC will have some FIFO memory, used to buffer data for immediate transmit and receive. As the FIFO drains for transmit, or fills up for receive, the NIC must request to become the bus master to burst data to the FIFO for transmit, or from the FIFO for receive.
If the latency to schedule the NIC as bus master is excessive, the FIFO will drain for transmit or will overflow for receive. Either will cause a packet to be lost.

Excessive bus master scheduling latency used to be more of a problem in QNX 4, where other devices (e.g. disk) were programmed with excessive DMA burst length; they would "park" themselves on the bus. This doesn't appear to be as much of a problem in QNX Neutrino, but you should be aware that it can be a problem if you're suffering from mysterious packet loss. The nicinfo output can often give you a clue here.
Far more likely, when you're encountering packet loss at the driver/hardware level, is that the transmit and receive descriptor rings are overflowing.
For receive, this usually happens when a high (e.g. greater than 21) priority thread runs READY and hogs the CPU for an extended period of time. This causes io-pkt* to not be scheduled, and the receive descriptor eventually fills up as packets arrive, and the NIC bus-masters the received packets into main memory.

For transmit, this usually happens when there's an extremely large burst of transmit activity (e.g. server) and possibly some kind of backup or congestion (e.g. PAUSE frames) which simply fills up the transmit descriptor ring faster than the NIC can get it out onto the wire.

In this network driver patch, the drivers for the high-performance NICs are generally configured with a default 64 transmit descriptors and 128 receive descriptors. You can change them using the transmit=XXXX and receive=XXXX command-line options to the drivers. Generally, the minimum allowed is 16, and the maximum is 2048. Due to the hardware design, stick to a power of 2, such as 16, 32, 64, 128, 256, 512, 1024, or 2048.

Transmit buffer descriptors are generally quite small, generally in the range of 8 to 64 bytes. So, the cost of increasing the transmit=XXXX value to, say, 1024 for a server (which sees large bursts of transmitted data) is quite small:

(1024 - 64) x 32 = 30,720 bytes

for a transmit descriptor of 32 bytes.

Receive buffer descriptors are similarly quite small, however there's a catch. For each receive descriptor, the driver must allocate a 1,500 byte Ethernet packet buffer. Because the packet buffers must be aligned, they aren't permitted to cross a 4 KB page boundary, so in reality, io-pkt* allocates a 2 KB buffer for each 1,500 byte Ethernet packet.

So, the cost of increasing the receive descriptor to 1024 from the default 128, with an almost insignificant 32-byte-sized receive descriptor is:

(1024 - 128) x (32 + 2048) = 1,863,680 bytes

or almost 2 Megabytes, which is nowhere near as much as the filesystem grabs by default for its cache, but still not an insignificant amount of memory for a memory-constrained embedded system.

For a memory-constrained system, you should carefully select the sizes of the transmit and receive descriptor rings so that they're minimum-sized, yet no packets are lost under load, with the scheduling latency for io-pkt* on your system.

Obviously, reducing the receive descriptor ring has more of an effect than reducing the transmit descriptor ring.

In an application where memory is of no concern, but maximum performance is, generally transmit and receive descriptor rings of 1024 or even 2048 are used.

Most of the time, bigger is better. There is, however, a potential catch: for some benchmarks, such as RFC2544 (fast forwarding), we've observed that excessively large descriptor rings decrease performance because of cache thrashing.

However, that's really getting out there. Most of the time, you simply need to configure the transmit and receive descriptor ring size to suit your application so that minimum memory is consumed, and no packets are lost.