Fine-tuning your network drivers

This technical note is intended to help you tune your network drivers for increased performance or reduced memory footprint.

First, we need to talk about network driver interface hardware chips—ASICs—which are sometimes referred to as NICs (network interface controllers, or network interface chips).

At the risk of oversimplifying, we can categorize NICs into two groups: high-performance and low-performance. We aren't talking about the media bit rate (10, 100 or 1000 Mbit) but rather the ability of the complete system, when using the NIC, to avoid packet loss.

Hardware engineers work very hard to design media that rarely lose a packet. And there is a gain, or amplifying effect: when you lose 1% of your packets, you don't lose 1% of your throughput via a protocol, you lose around 50% of your throughput, due to the incurred software-level protocol timeouts and/or retransmissions.

High-performance NICs

What we call "high-performance" NICs have the ability, in a loaded system, to not lose any packets. They generally do this by using transmit and receive descriptor rings in main memory, which in turn point to packet buffers also in main memory.

High-performance NICs use bus-master DMA (direct memory access) to transfer packet data to and from main memory entirely independent of the CPU, using the descriptor rings as laundry lists of packet transmit and receive requests to carry out.

Thus, large scheduling latencies in software that service the NIC (e.g., io-pkt*) can be tolerated.

Low-performance NICs

What we call "low-performance" NICs have been observed by users, in loaded systems, to consistently lose packets, with corresponding poor data throughput performance. These NICs don't use descriptor rings and DMA, but for simplicity, instead attempt to buffer the entire packet in a (usually limited) on-chip buffer area.

Unfortunately, these low-performance NICs, because of their low cost and size, are very attractive to board designers. Examples of these (usually older, obsolete) NICs include:

On a fast (e.g., 2 GHz) lightly loaded machine, these low-performance NICs can function adequately, without packet loss.

However on a slower (e.g., 100 MHz) machine that's CPU-bound with applications that may increase the scheduling latency of io-pkt*, packet loss during receive can often result because the limited hardware buffer overflows.

You shouldn't use these NICs where you need high-performance data throughput. You should use them only for low-cost debug and diagnostic ports, which are often removed for production versions of boards.

If you're using NFS used with one of these low-performance NICs, you can get a great improvement by using the -B4096 or even -B2048 option to fs-nfs. Qnet in QNX Neutrino 6.3 and later generally automatically goes into "windowed mode" with these NICs to try to avoid packet loss.

Tuning high-performance NIC drivers

Common examples of high-performance NICs include:

These NICs are all at least 10/100 Mbit, and some are gigabit, but what makes them high-performance is their ability to function independently of the CPU and use the large CPU main memory for packet buffering, which the low performance NICs by design can't.

There are two critical data-transfer interfaces to a high-performance NIC, which you must tune correctly to avoid packet loss under load:

PHY probing

Almost all of the network drivers have been optimized for performance with respect to PHY probing.

Formerly, network drivers would periodically (e.g., every two or three seconds) communicate via the MII to the PHY chip connected to the NIC, to determine the speed and duplex of the current media connection.

The problem is that, while the PHY is being probed, packet loss can occur. The drivers now contain an optimization to not probe the PHY, as long as there have recently been some packets received. This gives maximum performance for most users.

However, there is a nasty scenario: the NIC is connected to a 100 Mbit full-duplex link. The cable is rapidly unplugged and immediately replugged into a 10 Mbit half-duplex hub, which also has a steady stream of (e.g. broadcasted) received packets. In this scenario, because of the steady stream of received packets, the network driver won't probe the PHY, and will still think it's in 100 Mbit full-duplex. This is a problem, because the NIC isn't listening before it transmits; it's still full-duplex, on a half-duplex link. Excessive collisions and out-of-window collisions will result in packet loss.

However, if you leave the cable unplugged for three seconds before plugging it into another hub, the driver probes the PHY and relearns the media parameters and reprograms the NIC with the appropriate duplex.

If you need to rapidly unplug and replug the cable into network boxes with different duplexes, you should specify probe_phy=1 to the network driver, to force it to always periodically probe the PHY. Packet loss may result during this probing, but you will know that the driver is always in sync with the PHY with respect to the media.

For maximum performance, the default is probe_phy=0.

Speed and duplex

Most (but not all) of the drivers support more than one Ethernet speed and duplex. The most common are 10 and 100 Mbit, and half and full duplex, though the newest NICs support 1000 Mbit (gigabit) Ethernet as well.

All of the drivers let you specify speed=XX and duplex=Z, where XX is 10, 100 or 1000, and Z is 0 (zero) for half-duplex, and 1 (one) for full-duplex. Generally most 10 Mbit links are half-duplex (to old hubs or repeaters, which is the original Xerox blue-book Ethernet) and most 100 Mbit links are full-duplex (switches with point-to-point connections). However, for maximum confusion, you will occasionally see 10 Mbit/full-duplex, and 100 Mbit/half-duplex, but not very often.

If you don't specify speed and duplex, the driver attempts to auto-negotiate the speed and duplex to the fastest possible, by an IEEE specification. Most Ethernet hardware produced in the last few years is compliant with the IEEE specification, but there is some older hardware around that isn't.

In the absence of auto-negotiation, the PHY can figure out the speed pretty easily. However, the duplex is another matter. If auto-negotiation isn't supported, the remote device is assumed to be older and thus half-duplex, not full-duplex.

The moral of the story is that 99% of the time you shouldn't specify speed and duplex; the auto-negotiation should automatically figure it out for you. If you run the nicinfo utility, it will tell you what the auto-negotiated speed and duplex is, and most of the time, it will be correct.

Note: It's crucial that both devices, at both ends of the link, use the same speed and duplex, otherwise heavy packet loss can occur (see above).

So, you should specify speed and duplex to the network driver only if you have older, perhaps broken or nonstandard Ethernet hardware, and you manually control both ends of the Ethernet link. For example, a managed hub or a cross-over cable.