Common examples of high-performance NICs include:
These NICs are all at least 10/100 Mbit, and some are gigabit, but what makes them high-performance is their ability to function independently of the CPU and use the large CPU main memory for packet buffering, which the low performance NICs by design can't.
There are two critical data-transfer interfaces to a high-performance NIC, which you must tune correctly to avoid packet loss under load:
If the latency to schedule the NIC as bus master is excessive, the FIFO will drain for transmit or will overflow for receive. Either will cause a packet to be lost.
Excessive bus master scheduling latency used to be more of a problem in QNX 4, where other devices (e.g. disk) were programmed with excessive DMA burst length; they would "park" themselves on the bus. This doesn't appear to be as much of a problem in QNX Neutrino, but you should be aware that it can be a problem if you're suffering from mysterious packet loss. The nicinfo output can often give you a clue here.
For receive, this usually happens when a high (e.g. greater than 21) priority thread runs READY and hogs the CPU for an extended period of time. This causes io-pkt* to not be scheduled, and the receive descriptor eventually fills up as packets arrive, and the NIC bus-masters the received packets into main memory.
For transmit, this usually happens when there's an extremely large burst of transmit activity (e.g. server) and possibly some kind of backup or congestion (e.g. PAUSE frames) which simply fills up the transmit descriptor ring faster than the NIC can get it out onto the wire.
In this network driver patch, the drivers for the high-performance NICs are generally configured with a default 64 transmit descriptors and 128 receive descriptors. You can change them using the transmit=XXXX and receive=XXXX command-line options to the drivers. Generally, the minimum allowed is 16, and the maximum is 2048. Due to the hardware design, stick to a power of 2, such as 16, 32, 64, 128, 256, 512, 1024, or 2048.
Transmit buffer descriptors are generally quite small, generally in the range of 8 to 64 bytes. So, the cost of increasing the transmit=XXXX value to, say, 1024 for a server (which sees large bursts of transmitted data) is quite small:
(1024 - 64) x 32 = 30,720 bytes
for a transmit descriptor of 32 bytes.
Receive buffer descriptors are similarly quite small, however there's a catch. For each receive descriptor, the driver must allocate a 1,500 byte Ethernet packet buffer. Because the packet buffers must be aligned, they aren't permitted to cross a 4 KB page boundary, so in reality, io-pkt* allocates a 2 KB buffer for each 1,500 byte Ethernet packet.
So, the cost of increasing the receive descriptor to 1024 from the default 128, with an almost insignificant 32-byte-sized receive descriptor is:
(1024 - 128) x (32 + 2048) = 1,863,680 bytes
or almost 2 Megabytes, which is nowhere near as much as the filesystem grabs by default for its cache, but still not an insignificant amount of memory for a memory-constrained embedded system.
For a memory-constrained system, you should carefully select the sizes of the transmit and receive descriptor rings so that they're minimum-sized, yet no packets are lost under load, with the scheduling latency for io-pkt* on your system.
Obviously, reducing the receive descriptor ring has more of an effect than reducing the transmit descriptor ring.
In an application where memory is of no concern, but maximum performance is, generally transmit and receive descriptor rings of 1024 or even 2048 are used.
Most of the time, bigger is better. There is, however, a potential catch: for some benchmarks, such as RFC2544 (fast forwarding), we've observed that excessively large descriptor rings decrease performance because of cache thrashing.
However, that's really getting out there. Most of the time, you simply need to configure the transmit and receive descriptor ring size to suit your application so that minimum memory is consumed, and no packets are lost.