Improving the network throughput
Increasing the iflib Rx budget
Many of the PCI network drivers use the iflib framework.
This framework has a number of sysctl and tunable
parameters that you can set, including rx_budget
.
The rx_budget
tunable determines the maximum number
of packets that the Rx task queue thread can process in a batch.
Under heavy load, increasing this value can improve the Rx performance.
For example:
# sysctl -w dev.ix.0.iflib.rx_budget=100
A value of 0 means the default of 16 is used.
Considerations
Increasing the number of packets processed in a batch increases the latency on other packet flows in the network stack.
Changing the thread priorities
Generally speaking, there are three threads involved in the transmission and reception of a packet:
- A resource manager thread talking to the client application
- A task queue thread performing most of the work
- An interrupt service thread (IST)
# sysctl -d qnx.taskq_prio qnx.taskq_prio: taskq thread priority # sysctl -d qnx.ist_prio qnx.ist_prio: interrupt thread priority
Considerations
Changing the relative priorities may increase Rx performance at the expense of Tx performance and vice versa. Take care when you set priorities to make sure you do not starve other functionality of CPU time.
Changing the network dispatch service (netisr) policies
The netisr code is invoked as part of packet processing. It has three different dispatch policies (descriptions taken from the code):
- NETISR_DISPATCH_DEFERRED
- All work is deferred for a netisr, regardless of context (may be overriden by protocols).
- NETISR_DISPATCH_HYBRID
- If the executing context allows direct dispatch, and we're running on the CPU the work would be performed on, then direct dispatch it if it wouldn't violate ordering constraints on the workstream.
- NETISR_DISPATCH_DIRECT
- If the executing context allows direct dispatch, always direct dispatch. (The default.)
Also from the code: Notice that changing the global policy could lead to short periods of misordered processing, but this is considered acceptable as compared to the complexity of enforcing ordering during policy changes. Protocols can override the global policy (when they're not doing that, they select NETISR_DISPATCH_DEFAULT).
The NETISR_DISPATCH_DIRECT policy guarantees that all packet ordering constraints are met, but means that the netisr code is invoked directly by the task queue thread.
When the policy is NETISR_DISPATCH_DEFERRED, the netisr occurs in its own thread. This adjustment divides the work among the task queue and netisr threads, which potentially increases the throughput.
You set the netisr (kernel network dispatch service) dispatch policy via a sysctl variable. For more information, go to the sysctl entry in the Utilities Reference.
Considerations
The NETISR_DISPATCH_DEFERRED policy may violate ordering constraints.
Additional work needs to be performed when passing the work between the threads in a deferred dispatch scenario. If the task queue isn't maximized, deferring may result in worse throughput or increased CPU usage.
Increasing the UDS stream buffer sizes
Unix Domain Sockets (UDS) have sysctl parameters that define their socket buffer sizes. For SOCK_STREAM types, the transmit and receive buffers default to 8 kB. For example:
# sysctl net.local.stream
net.local.stream.sendspace: 8192
net.local.stream.recvspace: 8192
If large amounts of data are to be transferred over a stream UDS using reads and writes of more than 8 kB, increasing the sysctl parameter values can reduce the number of message passes and improve throughput. The following example increases them to 64 kB:
# sysctl -w net.local.stream.sendspace=65536
net.local.stream.sendspace: 8192 -> 65536
# sysctl -w net.local.stream.recvspace=65536
net.local.stream.recvspace: 8192 -> 65536
Considerations
Increasing the socket buffer sizes increases the memory consumed by each UDS. In a system with a large number of UDS, this additional memory usage can be significant.
Disabling the ICMP redirects for improved routing performance
By default, io-sock sends ICMP redirects when it acts as a router. The redirect functionality in io-sock requires the packets to go via a slower, full-functionality path in the code. If sending redirects when routing is not required, you should disable them, which lets the packets be forwarded via a fast-forwarding path. For example:
# sysctl net.inet.ip.redirect=0
net.inet.ip.redirect: 1 -> 0
# sysctl net.inet6.ip6.redirect=0
net.inet6.ip6.redirect: 1 -> 0
You can use protocol statistics to find out the number of packets forwarded and fast forwarded and the number of redirects sent. For example:
# netstat -s -p ip
ip:
...
0 packets forwarded (0 packets fast forwarded)
...
0 redirects sent
Considerations
ICMP redirects are required by RFC 792 but are rarely part of a modern network with a single router on a network segment.
Enabling the TCP ACK threshold settings
LRO (large receive offload)
The io-sock networking manager supports the large receive offload feature (LRO), which reassembles incoming network packets into larger buffers for Rx processing. LRO also has the effect of significantly reducing the number of ACK responses. Because io-sock is processing fewer ingress (batched data) packets, it generates fewer ACK responses.
ACK threshold sysctl variables
If your hardware or driver do not support LRO, you can use sysctl variables to reduce the number of ACK responses. (If LRO is supported, these variables have no effect because the ACK rate is already throttled.)
qnx.net.inet.tcp.tcp_ack_thres
- Normally, a TCP receiver sends an acknowledgment packet (ACK) after two Rx
packets: the first one sets a delayed ACK timer, and the second one sends an
ACK (if the previous ACK is delayed). The
qnx.net.inet.tcp.tcp_ack_thres
variable delays the second ACK until the outstanding data exceeds the specified size, expressed as a multiple of the maximum segment size (MSS). For an example, if set to 20, the second ACK is not sent until the size of the outstanding data is more than 20 times the MSS. However, if the threshold exceeds the space remaining in the Rx socket buffer, the receiver resumes sending ACKs andqnx.net.inet.tcp.tcp_ack_thres
is ignored. The delayed ACK is still sent if timeout occurs before the set threshold. qnx.net.inet.tcp.tcp_win_update_thres
- Adjusts when a window update packet is sent according to the rate at which
the receiving application is reading data. If the read buffer is emptied by
the specified multiple of the maximum segment size (MSS) and is within a
certain range of the socket buffer size, a window update packet is sent. The
value could be equal to or smaller than the value set by
qnx.net.inet.tcp.tcp_ack_thres
. qnx.net.inet.tcp.tcp_ack_on_push
- Enables an event-based ACK. It is designed for use with sockets that are sending less data, and thus fewer ACKs. When the sender sends a TCP data packet with the PUSH flag set (to report that it has no further data to send in its sending socket buffer), the receiver sends the ACK instead of relying on the delayed ACK.
These sysctl variables allow you to set ACK thresholds to reduce the amount of ACKs in best-case conditions (no packet loss). Because the delayed ACK is still armed, that ACK is sent at timeout. In addition, because algorithms such as FAST retransmit are still engaged and missed segments still generate ACK responses, the threshold is only in effect under ideal conditions (without detected packet loss).