Improving the network throughput

Several methods are available that can improve your system's network throughput. Which ones are effective depends on your environment.

Increasing the iflib Rx budget

Many of the PCI network drivers use the iflib framework. This framework has a number of sysctl and tunable parameters that you can set, including rx_budget. The rx_budget tunable determines the maximum number of packets that the Rx task queue thread can process in a batch. Under heavy load, increasing this value can improve the Rx performance. For example:

# sysctl -w dev.ix.0.iflib.rx_budget=100

A value of 0 means the default of 16 is used.

Considerations

Increasing the number of packets processed in a batch increases the latency on other packet flows in the network stack.

Changing the network dispatch service (netisr) policies

The netisr code is invoked as part of packet processing. It has three different dispatch policies (descriptions taken from the code):

NETISR_DISPATCH_DEFERRED: All work is deferred for a netisr, regardless of context (may be overriden by protocols).
NETISR_DISPATCH_HYBRID: If the executing context allows direct dispatch, and we're running on the CPU the work would be performed on, then direct dispatch it if it wouldn't violate ordering constraints on the workstream.
NETISR_DISPATCH_DIRECT: If the executing context allows direct dispatch, always direct dispatch. (The default.)

Also from the code: Notice that changing the global policy could lead to short periods of misordered processing, but this is considered acceptable as compared to the complexity of enforcing ordering during policy changes. Protocols can override the global policy (when they're not doing that, they select NETISR_DISPATCH_DEFAULT).

The NETISR_DISPATCH_DIRECT policy guarantees that all packet ordering constraints are met, but means that the netisr code is invoked directly by the task queue thread.

When the policy is NETISR_DISPATCH_DEFERRED, the netisr occurs in its own thread. This adjustment divides the work among the task queue and netisr threads, which potentially increases the throughput.

You set the netisr (kernel network dispatch service) dispatch policy via a sysctl variable. For more information, go to the sysctl entry in the Utilities Reference.

Considerations

The NETISR_DISPATCH_DEFERRED policy may violate ordering constraints.

Additional work needs to be performed when passing the work between the threads in a deferred dispatch scenario. If the task queue isn't maximized, deferring may result in worse throughput or increased CPU usage.

Increasing the UDS stream buffer sizes

Unix Domain Sockets (UDS) have sysctl parameters that define their socket buffer sizes. For SOCK_STREAM types, the transmit and receive buffers default to 8 kB. For example:

# sysctl net.local.stream
net.local.stream.sendspace: 8192
net.local.stream.recvspace: 8192

If large amounts of data are to be transferred over a stream UDS using reads and writes of more than 8 kB, increasing the sysctl parameter values can reduce the number of message passes and improve throughput. The following example increases them to 64 kB:

# sysctl -w net.local.stream.sendspace=65536
net.local.stream.sendspace: 8192 -> 65536
# sysctl -w net.local.stream.recvspace=65536
net.local.stream.recvspace: 8192 -> 65536

Considerations

Increasing the socket buffer sizes increases the memory consumed by each UDS. In a system with a large number of UDS, this additional memory usage can be significant.

Disabling the ICMP redirects for improved routing performance

By default, io-sock sends ICMP redirects when it acts as a router. The redirect functionality in io-sock requires the packets to go via a slower, full-functionality path in the code. If sending redirects when routing is not required, you should disable them, which lets the packets be forwarded via a fast-forwarding path. For example:

# sysctl net.inet.ip.redirect=0
net.inet.ip.redirect: 1 -> 0
# sysctl net.inet6.ip6.redirect=0
net.inet6.ip6.redirect: 1 -> 0

You can use protocol statistics to find out the number of packets forwarded and fast forwarded and the number of redirects sent. For example:

# netstat -s -p ip
ip:
...
0 packets forwarded (0 packets fast forwarded)
...
0 redirects sent

Considerations

ICMP redirects are required by RFC 792 but are rarely part of a modern network with a single router on a network segment.

Enabling the TCP ACK threshold settings

LRO (large receive offload)

The io-sock networking manager supports the large receive offload feature (LRO), which reassembles incoming network packets into larger buffers for Rx processing. LRO also has the effect of significantly reducing the number of ACK responses. Because io-sock is processing fewer ingress (batched data) packets, it generates fewer ACK responses.

ACK threshold sysctl variables

If your hardware or driver do not support LRO, you can use sysctl variables to reduce the number of ACK responses. (If LRO is supported, these variables have no effect because the ACK rate is already throttled.)

qnx.net.inet.tcp.tcp_ack_thres: Normally, a TCP receiver sends an acknowledgment packet (ACK) after two Rx packets: the first one sets a delayed ACK timer, and the second one sends an ACK (if the previous ACK is delayed). The qnx.net.inet.tcp.tcp_ack_thres variable delays the second ACK until the outstanding data exceeds the specified size, expressed as a multiple of the maximum segment size (MSS). For an example, if set to 20, the second ACK is not sent until the size of the outstanding data is more than 20 times the MSS. However, if the threshold exceeds the space remaining in the Rx socket buffer, the receiver resumes sending ACKs and qnx.net.inet.tcp.tcp_ack_thres is ignored. The delayed ACK is still sent if timeout occurs before the set threshold.
qnx.net.inet.tcp.tcp_win_update_thres: Adjusts when a window update packet is sent according to the rate at which the receiving application is reading data. If the read buffer is emptied by the specified multiple of the maximum segment size (MSS) and is within a certain range of the socket buffer size, a window update packet is sent. The value could be equal to or smaller than the value set by qnx.net.inet.tcp.tcp_ack_thres.; Because io-sock enables auto-scaling buffers by default, in most cases, the socket buffer size quickly scales beyond this window update threshold. It is more likely to affect ACK rate if smaller, fixed-socket buffer sizes are used.
qnx.net.inet.tcp.tcp_ack_on_push: Enables an event-based ACK. It is designed for use with sockets that are sending less data, and thus fewer ACKs. When the sender sends a TCP data packet with the PUSH flag set (to report that it has no further data to send in its sending socket buffer), the receiver sends the ACK instead of relying on the delayed ACK.

These sysctl variables allow you to set ACK thresholds to reduce the amount of ACKs in best-case conditions (no packet loss). Because the delayed ACK is still armed, that ACK is sent at timeout. In addition, because algorithms such as FAST retransmit are still engaged and missed segments still generate ACK responses, the threshold is only in effect under ideal conditions (without detected packet loss).

Adjusting packet processing for routing and bridging

By default, io-sock packet processing is performed by the Interrupt Service Thread (IST), which is the most effective method when your application is functioning as an endpoint (i.e., sending and receiving network traffic).

However, if routing or bridging are enabled, using the IST can degrade performance. To switch to using the taskqueue group thread instead, set the tunable dev.iface_name.iface_no.iflib.rxtx_in_ist to 0 (e.g., for the ix0 interface, dev.ix.0.iflib.rxtx_in_ist="0").

Enabling the Dynamic Interrupt Manager (DIM)

If your target hardware and BSP support it, io-sock uses the QNX OS Dynamic Interrupt Manager (DIM) to dynamically direct hardware interrupt requests (IRQs) to different CPUs, which can improve networking throughput. (In systems without DIM, all the interrupts for peripheral devices are sent to the CPU statically defined in startup, typically CPU0.)

Not all networking interfaces attempt to use DIM. To observe whether an interface is using DIM, start io-sock with the bootverbose option. The slog2info includes messages about which interrupt io-sock is binding to which CPU. For example:

vmx0: Bind IRQ 269 to cpu 2

If an interface attempts to use DIM but it is not enabled, slog2info output includes warnings from io-sock for the network interface that is affected. For example:

vmx0: Set irq affinity failed: No such file or directory

In this case, if your target hardware supports it, you can enable DIM support by adding the command to start DIM to the BSP buildfile.

Make sure that you start the DIM before the PCI server in the buildfile. For example:

...
slogger2
dumper
mqueue
random

display_msg "Starting Dynamic Interrupt Manager ..."
dim

###############################################################################
## PCI Server
###############################################################################
PCI_DEBUG_MODULE=/lib/dll/pci/pci_debug2.so
PCI_SLOG_MODULE=/lib/dll/pci/pci_slog2.so
PCI_BKWD_COMPAT_MODULE=/lib/dll/pci/pci_bkwd_compat.so
PCI_HW_MODULE=/lib/dll/pci/pci_hw-Intel_x86.so

display_msg "Starting PCI server ..."
pci-server --aspace-enable --config=/etc/system/config/pci/pci_server.cfg
...

Matching priorities

For best performance, make sure that application threads that communicate with io-sock run at the same scheduling priority as io-sock threads (default 21).

The io-sock service threads do not inherit the priority of the application when it passes a message. Thus, if io-sock and the application have different priorities, their threads may end up preempting each other, which can lead to additional CPU usage and poorer network throughput.