Common Network Recommendations:
style='border-collapse:collapse;mso-table-layout-alt:fixed;border:none;
mso-border-alt:solid windowtext .5pt;mso-yfti-tbllook:1184;mso-padding-alt:
0in 5.4pt 0in 5.4pt'>
|
|
|
|
kernel.sysrq | | | |
kernel.shmmax | | | |
net.core.netdev_max_backlog | | | |
net.core.optmem_max | | | |
net.core.rmem_default | | | |
net.core.wmem_default | | | |
net.core.rmem_max | | | |
net.core.wmem_max | | | |
net.ipv4.conf.all.arp_filter | | | |
net.ipv4.conf.all.arp_ignore | | | |
| 9 | 18 (9 preferred though some large systems set 18) | |
net.ipv4.neigh.ib0.ucast_solicit
| | | |
| | 30000 (30000 is sufficient for current largest system-X clusters. Minimum value is= total interfaces in cluster which may require ARP entries - generally num_nodes*interfaces) | |
| | 32000 (32000 is sufficient for current largest system-X clusters. Minimum value is= extra buffer (2000) + total interfaces in cluster which may require ARP entries - generally *interfaces) | |
| | 32768 (32768 is sufficient for current largest system-X clusters. Minimum value is= larger extra buffer (2768) + total interfaces in cluster which may require ARP entries - generally *interfaces) | |
| 2000000 | | |
| | | |
net.ipv4.tcp_adv_win_scale | | | defines how much socket buffer space is used for TCP window size vs how much to save for an application buffer 2=1/4 space is app. buffer |
net.ipv4.tcp_low_latency | | | intended to give preference to low latency over higher throughput; setting =1 disables IPV4 tcp prequeue processing, which Mellanox has recommended for large clusters |
net.ipv4.tcp_mem | | | IPV4 TCP memory usage values: min, pressure, max (in pages)
min: no contraints below this value
pressure: threshold for moderating memory consumption
max: hard max
|
net.ipv4.tcp_reordering | | | |
net.ipv4.tcp_rmem | | | IPV4 TCP receive socket buffer mem :
min, default, max
min: Minimal size of TCP receive buffer
default: initial size of TCP receive buffer (over-rides mem value used for other protocols)
max: max size of receive buffer allowed (limited by ) |
net.ipv4.tcp_wmem | | | |
net.ipv4.tcp_sack | | | |
net.ipv4.tcp_timestamps | | | |
net.ipv4.tcp_window_scaling | | | |
style='page-break-before:always'>
All 'ib0' tuning recommendations should be made for ALL interfaces for which IP performance and reliability are important. Here ib0 is just an example -- all interfaces for which critical subsystems (e.g. GPFS, LSF) have dependencies on should be tuned as per the ib0 examples except for cases which the tuning recommendations under (2) Less Reliable/Lower Bandwidth Networks are being followed and these recommendations conflict.
IPoIB
As per the sysctl recommendations, 'ib0'tuning recommendations should be made for ALL interfaces for which IP performance and reliability are important. Here ib0 is just the example interface that most clusters are concerns with.
Unless the is a strong need for optimal IPoIB performance we are currently recommending using datagram mode on clusters. Mellanox has agreed with this recommendation and points out that datagram in OFED 2 should be very close to the performance of connected mode.
We recommend:
/sys/class/net/ib0/mode = datagram
(again if there's an ib1 interface, /sys/class/net/ib1/mode should be set to datagram, etc.)
which is typically achieved by one of these two approaches, which must be applied to every IB interface, e.g. ib0, ib1, etc):
(1) For qlogic/Intel, In the appropriate
(2) For Mellanox adapters, in /etc/:
SET_IPOIB_CM=no
IP Interface Tuning:
Mellanox Adapter Interface Tuning
We should verify that all IB IP interfaces match the recommended tuning. The output of 'ip -s link' returns the following (example for ib0 but ALL interfaces, e.g. ib1, it2, etc, need to be verified):
'ip -s link' example output for an (example ib0) interface:
"ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc pfifo_fast state UP qlen 16384"
Things to verify:
The IPoIB MTU should be: 4092 (put in /etc/modprobe.d/mlx4k.conf set_4k_mtu=1 options mlx4_core set_4k_mtu=1)
Also for cases with large GPFS page pools, /etc/modprobe.d/mlx4k.conf should also set:
options mlx4_core log_num_mtt=20 log_mtts_per_seg=3
The IPoIB flags should be: BROADCAST,MULTICAST,UP,LOWER_UP
and the state of any working interface of course should be "UP"
The IPoIB interface QDISC has been tested on large clusters with the: pfifo_fast setting; however the mq (multi-queue) setting may be advisable on later adapters.
The IPoIB QLEN=16384
This has been tested on a cluster with more than 4000 nodes (smaller clusters may work well with smaller QLEN values but the overhead of increasing QLEN is believed not to be significant). The size of the QLEN reported by ifconfig will be twice the size of the queue sizes defined in the /etc/modprobe.d/ib_ipoib.conf file. e.g.:
options ib_ipoib lro=1
send_queue_size=8192
recv_queue_size=8192
The txqlen and rxqlen values reported by ifconfig will be twice the values loaded by the driver. The actual values that have been configured to the IB module can be determined by running:
cat /sys/module/ib_ipoib/parameters/
cat /sys/module/ib_ipoib/parameters/
Intel Adapter Interface Tuning
Define IP over IB receive queue length (on some IBM systems, we've set receive queue tuning in the /etc/modprobe.d/ib_ipoib.conf file)
options ib_ipoib recv_queue_size=1024
send_queue_size=512
Also it is recommended that for ethernet adapters that have performance or reliability metrics, the length of the ethernet IP transmit and receive queues should be increased to 2048:
in /etc/rc.local
:
ifconfig eth0 txqueuelen 2048 # will set the transmit queue to 2048 (if the adapter supports this length)
ethtool -G eth0 rx 2040 # will set the receive queue to 2048 (if the adapter supports this length)
(repeat for other ethernet devices, e.g. eth1 that support higher transmit and receive queue lengths):
The state of any working interface of course should be "UP"
Settings to avoid Linux Out of Memory Issues:
style='border-collapse:collapse;border:none;mso-border-alt:solid windowtext .5pt;
mso-yfti-tbllook:1184;mso-padding-alt:0in 5.4pt 0in 5.4pt'>
/proc Parameter | Recommended Value | Comments | Description |
/proc/sys/vm/oom_kill_allocating_task | 0 | | 0=when OOM killer invoked it employes heuristics to select a process making intensive memory allocations 1=when the OOM killer is invoked it kills the last process to allocate memory |
/proc/sys/vm/overcommit_memory | 2 | (some IBM clusters set this this value to 0, which may be workloads malloc() much more memory than they touch as long as a | 0= heuristic memory over- commit allowed
1=allocations always succeed
2=allocations succeed up to swap+(RAM*overcommit_ratio) |
/proc/sys/vm/overcommit_ratio | 99 | Maximum -- 110 (the extent of memory over-commit is dependent on the discrepancy between memory malloc'ed and touched - when running sparse matrix applications - higher overcommit_ratio values may be more appropriate) | this value is only relevant in cases in which sys/ |
Ulimits Tuning:
The following limits are recommended for default user limits on large clusters. Note that the ulimits are not a reliable method of enforcing memory limitations so it is recommended that ulimits be defined to effectively set unlimited memory limits and cgroups definitions be used to enforce memory limits.
Set these values in /etc/security/limits.conf:
* soft memlock -1