PowerVM

AIX Network Tuning for 10Ge and Virtual Network

By Jim Cunningham posted Mon June 22, 2020 12:47 PM

  

Authored By : Xiaohan Qin

(The blog is written in response to a customer’s request for performance tuning of 10Ge on POWER8. Most of the defaults and tuning recommendation are applicable to POWER7 as well.)

 

Typically network performance tuning involves setting network stack, interface, and NIC device parameters in addition to provisioning CPU and memory appropriately. In the case of virtual networking, the back-end device settings as well as virtual IO server resources are also relevant.

 
Table 1 displays key network  interface tuning parameters for 10Ge and their default values in the AIX environment. Most default values are sufficient to support 10Ge throughput, with a few exceptions noted in the column of “Remarks”.

 

Parameters

Component

Default

Remarks

rfc1323

stack

1

Default based adapter speed and MTU

See Table 2

tcp_sendspace

stack

262144

Default based adapter speed and MTU

See Table 2

tcp_recvspace

stack

262144

Default based adapter rate and MTU

See Table 2

udp_sendspace

stack

9216

Consider increasing to 64K  if the workloads stress the UDP protocol.

udp_recvspace

stack

42080

Consider increasing to 640K  if  the workloads stress the UDP protocol.

mtu

interface

1500

Consider setting to 9K and enabling jumbo_frame. Not default to 9K because it requires same setting from other nodes and switches on the LAN

jumbo_frame

device

no

Consider changing to yes. Require other nodes and the switches on the LAN supporting the configuration.

flow_ctrl

device

yes

 

chksum_offload

device

yes

 

large_send

device

yes

 

large_receive

device

yes

 

                             Table 1 : Network Interface Tuning Parameters for 10Ge (v)NIC

 

Note that the “interface” attributes/parameters can be changed as follows:

            chdev –l enX –a <attribute>=<value>

 

Similarly, the “device” attributes/parameters can be changed as below:

            chdev –l entX –a <attribute>=<value>

 

More on TCP/UDP send and receive space

 

For the TCP protocol, by default, AIX does not use system-wide “no” attributes for rfc1323, tcp_sendspace and tcp_recvspace.  Instead it uses interface specific network option (aka ISNO) so that these parameters can be configured based on the underlying adapter accordingly. Table 2 shows the AIX default values for the three TCP parameters. With rfc1323 on, 256K TCP send and receive space is capable of supporting 10Ge adapters.

 

Adapter/MTU

tcp_sendspace

tcp_recvspace

rfc1323

1G/1500

131072

65536

0

1G/9000

262144

131072

1

10G/1500

262144

262144

1

10G/9000

262144

262144

1

                                Table 2 : AIX defaults for TCP send and receive space and rfc1323

 

The ISNOs are a part of network interface attributes. The output below displays the ISNOs of interest. If any of the ISNO attributes is not set, the interface configuration takes the default from the Table 2.  The chdev  command (chdev –l  enX –a <attribute>=<value>) can be used to change the settings if necessary.

 

# lsattr -El en0|grep -E "rfc|space"

rfc1323                     Enable/Disable TCP RFC 1323 Window Scaling    True

tcp_recvspace               Set Socket Buffer Space for Receiving         True

tcp_sendspace               Set Socket Buffer Space for Sending           True

 

Currently, there is no ISNO attributes for UDP. In other words, the UDP protocol still relies on the global “no” attributes for udp_sendspace  and udp_recvspace, whose defaults may be too low for 10Ge adapters. They should be increased significantly (cf. Table 1) if the workloads stress the UDP protocol.

 

Additional 10Ge performance parameters

Table 3 lists additional tuning parameters common to many 10Ge NIC adapters. The exact attribute names and their defaults may vary from device to device. The out-of-box defaults have been tuned in the development labs to achieve the peak throughput of the adapters. No adjustment is needed except for perhaps very strenuous workloads, whose tuning might require the assistance of performance engineers.

 

Attributes

Remarks

ipv6 offload

ipv6 offload are configured independently from ipv4 offload

num of tx queues

The number of transmit queues

sz of tx queue

The size of transmit queues

num of rx queues

The number of receive queues

sz of rx queue

The size of receive queues

sz of sw tx queue

The size of software transmit queue.

intr_cnt

Interrupt coalescing counter

intr_time

Interrupt coalescing timer (microseconds)

 

Table 3: Additional 10Ge tuning parameters

Virtual Ethernet (VETH) backed by SEA

The VETH configuration attributes are unique and quite different from the attributes of physical NICs (compare Table 4 and Table 1). 

 

First of all, VETH lacks several attributes which are common for other physical NICs, namely, jumbo_frame, largesend, and large_receive.  The reason is that AIX VETH is capable of sending and receiving “super” packets up to ~64K with no required configuration. As a result, it is possible to set MTU over VETH as large as 60K. 

 

With that said, we caution against such practice because it is more likely to cause confusion and problems than bring benefits.  For traffic that is routed outside the server, large packets must be segmented/fragmented. Instead of sending large packets assuming a huge MTU, client VETH is better-off employing largesend.  The difference is that “largesend packets” convey the MSS (Maximum Segmentation Size).  Upon receiving “largesend packets”, SEA passes the information to physical NIC for TCP segmentation offload. Without MSS, SEA would have to perform IP fragmentation in software (slower). And because fragmented IP packets can be a security risk, they are blocked by firewall rules in some environment. In reality, with path MTU discovery (on by default in AIX), the MSS negotiated between two end points across servers reflects the network path MTU rather than the huge MTU set on the interface.

 

How does one enable “largesend” over VETH when the device itself does not have the attribute? The largesend configuration (over VETH) is done via network interface attribute mtu_bypass.  In AIX 7.1, the default for mtu_bypass is still “off”. In AIX 7.2, the default value has been changed to “on” (See Table 4).

 

Secondly, AIX VETH supports multiple receive buffer sizes, which is uncommon for physical NICs.  The device attributes include a set of receive buffer tuning parameters. Most of the time, the default values work fine. However, for heavy workloads, particularly for workloads consisting predominantly of small messages, the defaults for tiny and small buffers prove inadequate. We recommend increasing them to their maximums (See Table 4).

 

Another frequently asked question is how fast is the VETH device? A number of sources cited VETH to be equivalent to 1Ge. That is not accurate. The bandwidth of VETH is workload dependent. For stream workloads, VETH can easily achieve 20G or even higher. For transactional type of workloads, the VETH rate is less impressive, performing a little better than 1Ge. AIX network interface configuration treats VETH as 10Ge when it comes to setting defaults for TCP send and receive space and rfc1323.

 

Parameters

Component

Default

Remarks

rfc1323

stack

1

 

tcp_sendspace

stack

262144

 

tcp_recvspace

stack

262144

 

udp_sendspace

stack

9216

Consider increasing to 64K if the workloads stress the UDP protocol.

udp_recvspace

stack

42080

Consider increasing to 640K if the workloads stress the UDP protocol.

mtu

interface

1500

Not default to 9K as it requires proper setting from other nodes on the LAN.

mtu_bypass

interface

 

On

Off

This attribute applies to VETH only.

In AIX 7.2, default On

In AIX 7.1, default Off  (need to turn on)

Checksum_offload

device

yes

 

max_buf_huge

min_buf_huge

device

64

24

 

max_buf_large

min_buf_large

device

64

24

 

max_buf_medium

min_buf_medium

device

256

128

 

max_buf_small

min_buf_small

device

2048

512

Recommend 4096 for both.

 

max_buf_tiny

min_buf_tiny

device

2048

512

Recommend 4096 for both

                                           Table 4 : Tuning parameters for VETH

SEA configuration 

As mentioned at the beginning, for VETH to perform well, the SEA adapter must be configured accordingly. Table 5 includes the SEA attributes that impact the VETH throughput and their defaults in the recent versions of VIOS (2.2.4.0 later).   Please see the “Remarks” column for tuning recommendation.

Parameters

Default

Remarks

jumbo_frame

no

Consider changing to yes for high performance. Require other nodes and switches on the LAN supporting the configuration.

largesend

 

1

 

 

 

large_receive

0

Enable large_receive.

The reason that large_receive has been disabled by default is because previously Linux and IBM-i were not able to handle large_receive packets. This issue has been resolved recently by VIOS, Linux and IBM-i. We plan to enable large_receive by default in the near future.

thread

1

 

nthreads

7

 

realin_threads

0

Number of threads dedicated for processing packets received on physical NIC. “0” means packets received on virtual and physical sides share the threads. Note realin_threads < nthreads.

queue_size

8192

 

                                       Table 5 : Performance tuning parameters for SEA

 

For configuration parameters of the physical NIC under SEA, please refer to the “device” attributes in Table 1. Likewise, for the configuration parameter of VETH under SEA, please refer to the “device” attributes in Table 4.

 

SEA is a dual-function device. On one hand it serves as a bridge for client LPARs, on the other hand it can be used as a network interface device for the VIOS partition. For the bridging function, the client LPAR traffic does not involve VIOS networking stack. Hence VIOS/SEA network stack or interface related parameters such as TCP send and receive space and MTU are irrelevant for client VETH performance. In case the SEA is used as a NIC for a VIOS partition, the network stack and interface parameters over SEA are determined based on the physical adapter underneath SEA (cf. Table 1 and Table 2).

SR-IOV backed vNIC

Although the SR-IOV backed vNIC adapter is a virtual device and capable of LPM, configuration-wise it resembles more like a physical NIC than VETH (use Table 1 and Table 2). The recommendation is to turn on largesend, large_receive, and jumbo frame if possible. Note that to enable jumbo_frame on vNIC, one must enable jumbo_frame on the physical port of the SR-IOV adapter, which can be done through HMC GUI. This requirement is similar to enabling jumbo_frame for a HEA logical port (on POWER7 systems).

CPU and memory resources 

The rule of thumb for CPU sizing for 10Ge adapters is 1-2 processor core per 10Ge throughput depending on the processor, more for Power 7 (~2) and less for Power 8 (~1.5).

 

As for memory, the bulk part of device memory consumption comes from the packet buffers used in Tx/Rx. If jumbo_frame is not enabled, the packet buffer size is 2K. Otherwise, it is 16K.  For a device configured with 3 Rx queues of 1K elements, 2 Tx queues of 1K, a software Tx queue of 8K elements, and jumbo_frame not enabled, the memory footprint can be as large as ~26MB, calculated based on ~(2*1K+3*1K+8K)*2K. In case jumbo_frame is enabled, the memory footprint may be increased by 8 fold.

References: 

http://www.ibmsystemsmag.com/aix/administrator/networks/network_tuning/

http://www.ibmsystemsmag.com/aix/administrator/networks/A-Primer-on-Power-Systems-10-Gb-Ethernet

Contacting the PowerVM Team

Have questions for the PowerVM team or want to learn more?  Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions


0 comments
76 views

Permalink