Spectrum Scale operating system and network tuning

View Only

Spectrum Scale operating system and network tuning

By Fred Stock posted Fri March 27, 2020 09:16 PM

Like

NOTE: This content was originally published under the IBM developerWorks site. Since the location where this content was published is being taken offline the content is being copied here so it can continue to be accessed.

xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

Common Network Recommendations:

style='border-collapse:collapse;mso-table-layout-alt:fixed;border:none;
mso-border-alt:solid windowtext .5pt;mso-yfti-tbllook:1184;mso-padding-alt:
0in 5.4pt 0in 5.4pt'>

kernel.sysrq

kernel.shmmax

net.core.netdev_max_backlog

net.core.optmem_max

net.core.rmem_default

net.core.wmem_default

net.core.rmem_max

net.core.wmem_max

net.ipv4.conf.all.arp_filter

net.ipv4.conf.all.arp_ignore

9

18

(9 preferred though some large systems
set 18)

net.ipv4.neigh.ib0.ucast_solicit

30000

(30000 is sufficient for current
largest system-X clusters. Minimum value is= total interfaces in cluster
which may require ARP entries - generally num_nodes*interfaces)

32000

(32000 is sufficient for current
largest system-X clusters. Minimum value is= extra buffer (2000) + total
interfaces in cluster which may require ARP entries - generally *interfaces)

32768

(32768 is sufficient for current
largest system-X clusters. Minimum value is= larger extra buffer (2768) +
total interfaces in cluster which may require ARP entries - generally *interfaces)

2000000

net.ipv4.tcp_adv_win_scale

defines how much socket buffer
space is used for TCP window size vs how much to save for an application
buffer

2=1/4 space is app. buffer

net.ipv4.tcp_low_latency

intended to give preference to low
latency over higher throughput; setting =1 disables IPV4 tcp
prequeue processing, which Mellanox has recommended
for large clusters

net.ipv4.tcp_mem

IPV4 TCP memory usage
values:

min, pressure, max (in pages)

    min: no contraints below this
value

    pressure: threshold for moderating memory consumption

max: hard max

net.ipv4.tcp_reordering

net.ipv4.tcp_rmem

IPV4 TCP receive socket buffer mem
:

min, default, max

   min: Minimal size of TCP receive buffer

   default: initial size of TCP receive buffer (over-rides
mem value used for other protocols)

    max: max size of receive buffer allowed (limited by )

net.ipv4.tcp_wmem

net.ipv4.tcp_sack

net.ipv4.tcp_timestamps

net.ipv4.tcp_window_scaling

style='page-break-before:always'>

All 'ib0' tuning recommendations should be made for ALL interfaces for which IP performance and reliability are important.   Here ib0 is just an example -- all interfaces for which critical subsystems (e.g. GPFS, LSF) have dependencies on should be tuned as per the ib0 examples except for cases which the tuning recommendations under (2) Less Reliable/Lower Bandwidth Networks are being followed and these recommendations conflict.

IPoIB

As per the sysctl recommendations, 'ib0'tuning recommendations should be made for ALL interfaces for which IP performance and reliability are important.   Here ib0 is just the example interface that most clusters are concerns with.

Unless the is a strong need for optimal IPoIB performance we are currently recommending using datagram mode on clusters. Mellanox has agreed with this recommendation and points out that datagram in OFED 2 should be very close to the performance of connected mode.

We recommend:

/sys/class/net/ib0/mode = datagram

(again if there's an ib1 interface, /sys/class/net/ib1/mode should be set to datagram, etc.)

which is typically achieved by one of these two approaches, which must be applied to every IB interface, e.g. ib0, ib1, etc):

(1) For qlogic/Intel, In the appropriate

(2) For Mellanox adapters, in /etc/:

SET_IPOIB_CM=no

IP Interface Tuning:

Mellanox Adapter Interface Tuning

We should verify that all IB IP interfaces match the recommended tuning. The output of 'ip -s link' returns the following (example for ib0 but ALL interfaces, e.g. ib1, it2, etc, need to be verified):

'ip -s link' example output for an (example ib0) interface:

"ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc pfifo_fast state UP qlen 16384"

Things to verify:

The IPoIB MTU should be: 4092 (put in /etc/modprobe.d/mlx4k.conf set_4k_mtu=1 options mlx4_core set_4k_mtu=1)

Also for cases with large GPFS page pools, /etc/modprobe.d/mlx4k.conf should also set:

options mlx4_core log_num_mtt=20 log_mtts_per_seg=3

The IPoIB flags should be: BROADCAST,MULTICAST,UP,LOWER_UP

and the state of any working interface of course should be "UP"

The IPoIB interface QDISC has been tested on large clusters with the: pfifo_fast setting; however the mq (multi-queue) setting may be advisable on later adapters.

The IPoIB QLEN=16384

This has been tested on a cluster with more than 4000 nodes (smaller clusters may work well with smaller QLEN values but the overhead of increasing QLEN is believed not to be significant). The size of the QLEN reported by ifconfig will be twice the size of the queue sizes defined in the /etc/modprobe.d/ib_ipoib.conf file. e.g.:

options ib_ipoib lro=1
send_queue_size=8192
recv_queue_size=8192

The txqlen and rxqlen values reported by ifconfig will be twice the values loaded by the driver. The actual values that have been configured to the IB module can be determined by running:

cat /sys/module/ib_ipoib/parameters/

cat /sys/module/ib_ipoib/parameters/

Intel Adapter Interface Tuning

Define IP over IB receive queue length (on some IBM systems, we've set receive queue tuning in the /etc/modprobe.d/ib_ipoib.conf file)

options ib_ipoib recv_queue_size=1024
send_queue_size=512

Also it is recommended that for ethernet adapters that have performance or reliability metrics, the length of the ethernet IP transmit and receive queues should be increased to 2048:

in /etc/rc.local:

ifconfig eth0 txqueuelen 2048   # will set the transmit queue to 2048 (if the adapter supports this length)

ethtool -G eth0 rx 2040               # will set the receive queue to 2048 (if the adapter supports this length)

(repeat for other ethernet devices, e.g. eth1 that support higher transmit and receive queue lengths):

The state of any working interface of course should be "UP"

Settings to avoid Linux Out of Memory Issues:

style='border-collapse:collapse;border:none;mso-border-alt:solid windowtext .5pt;
mso-yfti-tbllook:1184;mso-padding-alt:0in 5.4pt 0in 5.4pt'>

/proc Parameter

Recommended

Value

Comments

Description

/proc/sys/vm/oom_kill_allocating_task

0

0=when OOM killer invoked it employes heuristics
to select a process making intensive memory allocations

1=when the OOM killer is invoked it kills the last process to allocate
memory

/proc/sys/vm/overcommit_memory

2

(some IBM clusters set this this value to 0, which may be
workloads malloc() much more memory than they touch as long as a

0= heuristic memory over- commit allowed

1=allocations always succeed

2=allocations succeed up to swap+(RAM*overcommit_ratio)

/proc/sys/vm/overcommit_ratio

99

Maximum -- 110

(the extent of memory over-commit is dependent on the discrepancy between
memory malloc'ed and touched - when running sparse
matrix applications - higher overcommit_ratio
values may be more appropriate)

this value is only relevant in cases in which sys/

Ulimits Tuning:

The following limits are recommended for default user limits on large clusters. Note that the ulimits are not a reliable method of enforcing memory limitations so it is recommended that ulimits be defined to effectively set unlimited memory limits and cgroups definitions be used to enforce memory limits.

Set these values in /etc/security/limits.conf:

* soft    memlock      -1

#Softwaredefinedstorage
#SpectrumScale
#Workloadandresourceoptimization

0 comments

62 views

IBM Storage

The online community where IBM Storage users meet, share, discuss, and learn.

File and Object Storage

Spectrum Scale operating system and network tuning

By Fred Stock posted Fri March 27, 2020 09:16 PM

Permalink


kernel.sysrq
kernel.shmmax
net.core.netdev_max_backlog
net.core.optmem_max
net.core.rmem_default
net.core.wmem_default
net.core.rmem_max
net.core.wmem_max
net.ipv4.conf.all.arp_filter
net.ipv4.conf.all.arp_ignore
	9	18 (9 preferred though some large systems set 18)
net.ipv4.neigh.ib0.ucast_solicit
		30000 (30000 is sufficient for current largest system-X clusters. Minimum value is= total interfaces in cluster which may require ARP entries - generally num_nodes*interfaces)
		32000 (32000 is sufficient for current largest system-X clusters. Minimum value is= extra buffer (2000) + total interfaces in cluster which may require ARP entries - generally *interfaces)
		32768 (32768 is sufficient for current largest system-X clusters. Minimum value is= larger extra buffer (2768) + total interfaces in cluster which may require ARP entries - generally *interfaces)
	2000000

net.ipv4.tcp_adv_win_scale			defines how much socket buffer space is used for TCP window size vs how much to save for an application buffer 2=1/4 space is app. buffer
net.ipv4.tcp_low_latency			intended to give preference to low latency over higher throughput; setting =1 disables IPV4 tcp prequeue processing, which Mellanox has recommended for large clusters
net.ipv4.tcp_mem			IPV4 TCP memory usage values: min, pressure, max (in pages) min: no contraints below this value pressure: threshold for moderating memory consumption max: hard max
net.ipv4.tcp_reordering
net.ipv4.tcp_rmem			IPV4 TCP receive socket buffer mem : min, default, max min: Minimal size of TCP receive buffer default: initial size of TCP receive buffer (over-rides mem value used for other protocols) max: max size of receive buffer allowed (limited by )
net.ipv4.tcp_wmem
net.ipv4.tcp_sack
net.ipv4.tcp_timestamps
net.ipv4.tcp_window_scaling

/proc Parameter	Recommended Value	Comments	Description
/proc/sys/vm/oom_kill_allocating_task	0		0=when OOM killer invoked it employes heuristics to select a process making intensive memory allocations 1=when the OOM killer is invoked it kills the last process to allocate memory
/proc/sys/vm/overcommit_memory	2	(some IBM clusters set this this value to 0, which may be workloads malloc() much more memory than they touch as long as a	0= heuristic memory over- commit allowed 1=allocations always succeed 2=allocations succeed up to swap+(RAM*overcommit_ratio)
/proc/sys/vm/overcommit_ratio	99	Maximum -- 110 (the extent of memory over-commit is dependent on the discrepancy between memory malloc'ed and touched - when running sparse matrix applications - higher overcommit_ratio values may be more appropriate)	this value is only relevant in cases in which sys/