Authored By : Xiaohan Qin
(The blog is written in response to a customer’s request for performance tuning of 10Ge on POWER8. Most of the defaults and tuning recommendation are applicable to POWER7 as well.)
Typically network performance tuning involves setting network stack, interface, and NIC device parameters in addition to provisioning CPU and memory appropriately. In the case of virtual networking, the back-end device settings as well as virtual IO server resources are also relevant.
Table 1 displays key network interface tuning parameters for 10Ge and their default values in the AIX environment. Most default values are sufficient to support 10Ge throughput, with a few exceptions noted in the column of “Remarks”.
Parameters
|
Component
|
Default
|
Remarks
|
rfc1323
|
stack
|
1
|
Default based adapter speed and MTU
See Table 2
|
tcp_sendspace
|
stack
|
262144
|
Default based adapter speed and MTU
See Table 2
|
tcp_recvspace
|
stack
|
262144
|
Default based adapter rate and MTU
See Table 2
|
udp_sendspace
|
stack
|
9216
|
Consider increasing to 64K if the workloads stress the UDP protocol.
|
udp_recvspace
|
stack
|
42080
|
Consider increasing to 640K if the workloads stress the UDP protocol.
|
mtu
|
interface
|
1500
|
Consider setting to 9K and enabling jumbo_frame. Not default to 9K because it requires same setting from other nodes and switches on the LAN
|
jumbo_frame
|
device
|
no
|
Consider changing to yes. Require other nodes and the switches on the LAN supporting the configuration.
|
flow_ctrl
|
device
|
yes
|
|
chksum_offload
|
device
|
yes
|
|
large_send
|
device
|
yes
|
|
large_receive
|
device
|
yes
|
|
Table 1 : Network Interface Tuning Parameters for 10Ge (v)NIC
Note that the “interface” attributes/parameters can be changed as follows:
chdev –l enX –a <attribute>=<value>
Similarly, the “device” attributes/parameters can be changed as below:
chdev –l entX –a <attribute>=<value>
More on TCP/UDP send and receive space
For the TCP protocol, by default, AIX does not use system-wide “no” attributes for rfc1323, tcp_sendspace and tcp_recvspace. Instead it uses interface specific network option (aka ISNO) so that these parameters can be configured based on the underlying adapter accordingly. Table 2 shows the AIX default values for the three TCP parameters. With rfc1323 on, 256K TCP send and receive space is capable of supporting 10Ge adapters.
Adapter/MTU
|
tcp_sendspace
|
tcp_recvspace
|
rfc1323
|
1G/1500
|
131072
|
65536
|
0
|
1G/9000
|
262144
|
131072
|
1
|
10G/1500
|
262144
|
262144
|
1
|
10G/9000
|
262144
|
262144
|
1
|
Table 2 : AIX defaults for TCP send and receive space and rfc1323
The ISNOs are a part of network interface attributes. The output below displays the ISNOs of interest. If any of the ISNO attributes is not set, the interface configuration takes the default from the Table 2. The chdev command (chdev –l enX –a <attribute>=<value>) can be used to change the settings if necessary.
# lsattr -El en0|grep -E "rfc|space"
rfc1323 Enable/Disable TCP RFC 1323 Window Scaling True
tcp_recvspace Set Socket Buffer Space for Receiving True
tcp_sendspace Set Socket Buffer Space for Sending True
Currently, there is no ISNO attributes for UDP. In other words, the UDP protocol still relies on the global “no” attributes for udp_sendspace and udp_recvspace, whose defaults may be too low for 10Ge adapters. They should be increased significantly (cf. Table 1) if the workloads stress the UDP protocol.
Additional 10Ge performance parameters
Table 3 lists additional tuning parameters common to many 10Ge NIC adapters. The exact attribute names and their defaults may vary from device to device. The out-of-box defaults have been tuned in the development labs to achieve the peak throughput of the adapters. No adjustment is needed except for perhaps very strenuous workloads, whose tuning might require the assistance of performance engineers.
Attributes
|
Remarks
|
ipv6 offload
|
ipv6 offload are configured independently from ipv4 offload
|
num of tx queues
|
The number of transmit queues
|
sz of tx queue
|
The size of transmit queues
|
num of rx queues
|
The number of receive queues
|
sz of rx queue
|
The size of receive queues
|
sz of sw tx queue
|
The size of software transmit queue.
|
intr_cnt
|
Interrupt coalescing counter
|
intr_time
|
Interrupt coalescing timer (microseconds)
|
Table 3: Additional 10Ge tuning parameters
Virtual Ethernet (VETH) backed by SEA
The VETH configuration attributes are unique and quite different from the attributes of physical NICs (compare Table 4 and Table 1).
First of all, VETH lacks several attributes which are common for other physical NICs, namely, jumbo_frame, largesend, and large_receive. The reason is that AIX VETH is capable of sending and receiving “super” packets up to ~64K with no required configuration. As a result, it is possible to set MTU over VETH as large as 60K.
With that said, we caution against such practice because it is more likely to cause confusion and problems than bring benefits. For traffic that is routed outside the server, large packets must be segmented/fragmented. Instead of sending large packets assuming a huge MTU, client VETH is better-off employing largesend. The difference is that “largesend packets” convey the MSS (Maximum Segmentation Size). Upon receiving “largesend packets”, SEA passes the information to physical NIC for TCP segmentation offload. Without MSS, SEA would have to perform IP fragmentation in software (slower). And because fragmented IP packets can be a security risk, they are blocked by firewall rules in some environment. In reality, with path MTU discovery (on by default in AIX), the MSS negotiated between two end points across servers reflects the network path MTU rather than the huge MTU set on the interface.
How does one enable “largesend” over VETH when the device itself does not have the attribute? The largesend configuration (over VETH) is done via network interface attribute mtu_bypass. In AIX 7.1, the default for mtu_bypass is still “off”. In AIX 7.2, the default value has been changed to “on” (See Table 4).
Secondly, AIX VETH supports multiple receive buffer sizes, which is uncommon for physical NICs. The device attributes include a set of receive buffer tuning parameters. Most of the time, the default values work fine. However, for heavy workloads, particularly for workloads consisting predominantly of small messages, the defaults for tiny and small buffers prove inadequate. We recommend increasing them to their maximums (See Table 4).
Another frequently asked question is how fast is the VETH device? A number of sources cited VETH to be equivalent to 1Ge. That is not accurate. The bandwidth of VETH is workload dependent. For stream workloads, VETH can easily achieve 20G or even higher. For transactional type of workloads, the VETH rate is less impressive, performing a little better than 1Ge. AIX network interface configuration treats VETH as 10Ge when it comes to setting defaults for TCP send and receive space and rfc1323.
Parameters
|
Component
|
Default
|
Remarks
|
rfc1323
|
stack
|
1
|
|
tcp_sendspace
|
stack
|
262144
|
|
tcp_recvspace
|
stack
|
262144
|
|
udp_sendspace
|
stack
|
9216
|
Consider increasing to 64K if the workloads stress the UDP protocol.
|
udp_recvspace
|
stack
|
42080
|
Consider increasing to 640K if the workloads stress the UDP protocol.
|
mtu
|
interface
|
1500
|
Not default to 9K as it requires proper setting from other nodes on the LAN.
|
mtu_bypass
|
interface
|
On
Off
|
This attribute applies to VETH only.
In AIX 7.2, default On
In AIX 7.1, default Off (need to turn on)
|
Checksum_offload
|
device
|
yes
|
|
max_buf_huge
min_buf_huge
|
device
|
64
24
|
|
max_buf_large
min_buf_large
|
device
|
64
24
|
|
max_buf_medium
min_buf_medium
|
device
|
256
128
|
|
max_buf_small
min_buf_small
|
device
|
2048
512
|
Recommend 4096 for both.
|
max_buf_tiny
min_buf_tiny
|
device
|
2048
512
|
Recommend 4096 for both
|
Table 4 : Tuning parameters for VETH
SEA configuration
As mentioned at the beginning, for VETH to perform well, the SEA adapter must be configured accordingly. Table 5 includes the SEA attributes that impact the VETH throughput and their defaults in the recent versions of VIOS (2.2.4.0 later). Please see the “Remarks” column for tuning recommendation.
Parameters
|
Default
|
Remarks
|
jumbo_frame
|
no
|
Consider changing to yes for high performance. Require other nodes and switches on the LAN supporting the configuration.
|
largesend
|
1
|
|
large_receive
|
0
|
Enable large_receive.
The reason that large_receive has been disabled by default is because previously Linux and IBM-i were not able to handle large_receive packets. This issue has been resolved recently by VIOS, Linux and IBM-i. We plan to enable large_receive by default in the near future.
|
thread
|
1
|
|
nthreads
|
7
|
|
realin_threads
|
0
|
Number of threads dedicated for processing packets received on physical NIC. “0” means packets received on virtual and physical sides share the threads. Note realin_threads < nthreads.
|
queue_size
|
8192
|
|
Table 5 : Performance tuning parameters for SEA
For configuration parameters of the physical NIC under SEA, please refer to the “device” attributes in Table 1. Likewise, for the configuration parameter of VETH under SEA, please refer to the “device” attributes in Table 4.
SEA is a dual-function device. On one hand it serves as a bridge for client LPARs, on the other hand it can be used as a network interface device for the VIOS partition. For the bridging function, the client LPAR traffic does not involve VIOS networking stack. Hence VIOS/SEA network stack or interface related parameters such as TCP send and receive space and MTU are irrelevant for client VETH performance. In case the SEA is used as a NIC for a VIOS partition, the network stack and interface parameters over SEA are determined based on the physical adapter underneath SEA (cf. Table 1 and Table 2).
SR-IOV backed vNIC
Although the SR-IOV backed vNIC adapter is a virtual device and capable of LPM, configuration-wise it resembles more like a physical NIC than VETH (use Table 1 and Table 2). The recommendation is to turn on largesend, large_receive, and jumbo frame if possible. Note that to enable jumbo_frame on vNIC, one must enable jumbo_frame on the physical port of the SR-IOV adapter, which can be done through HMC GUI. This requirement is similar to enabling jumbo_frame for a HEA logical port (on POWER7 systems).
CPU and memory resources
The rule of thumb for CPU sizing for 10Ge adapters is 1-2 processor core per 10Ge throughput depending on the processor, more for Power 7 (~2) and less for Power 8 (~1.5).
As for memory, the bulk part of device memory consumption comes from the packet buffers used in Tx/Rx. If jumbo_frame is not enabled, the packet buffer size is 2K. Otherwise, it is 16K. For a device configured with 3 Rx queues of 1K elements, 2 Tx queues of 1K, a software Tx queue of 8K elements, and jumbo_frame not enabled, the memory footprint can be as large as ~26MB, calculated based on ~(2*1K+3*1K+8K)*2K. In case jumbo_frame is enabled, the memory footprint may be increased by 8 fold.
References:
http://www.ibmsystemsmag.com/aix/administrator/networks/network_tuning/
http://www.ibmsystemsmag.com/aix/administrator/networks/A-Primer-on-Power-Systems-10-Gb-Ethernet
Contacting the PowerVM Team
Have questions for the PowerVM team or want to learn more? Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions