File and Object Storage

Spectrum Scale operating system and network tuning

By Fred Stock posted Fri March 27, 2020 09:16 PM

  
NOTE: This content was originally published under the IBM developerWorks site. Since the location where this content was published is being taken offline the content is being copied here so it can continue to be accessed.

xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
















Common Network Recommendations:



style='border-collapse:collapse;mso-table-layout-alt:fixed;border:none;
mso-border-alt:solid windowtext .5pt;mso-yfti-tbllook:1184;mso-padding-alt:
0in 5.4pt 0in 5.4pt'>












































































































































































kernel.sysrq









kernel.shmmax









net.core.netdev_max_backlog









net.core.optmem_max









net.core.rmem_default









net.core.wmem_default









net.core.rmem_max









net.core.wmem_max









net.ipv4.conf.all.arp_filter









net.ipv4.conf.all.arp_ignore













9



18


(9 preferred though some large systems
set 18)





net.ipv4.neigh.ib0.ucast_solicit

















30000


(30000 is sufficient for current
largest system-X clusters. Minimum value is= total interfaces in cluster
which may require ARP entries - generally num_nodes*interfaces)










32000


(32000 is sufficient for current
largest system-X clusters. Minimum value is= extra buffer (2000) + total
interfaces in cluster which may require ARP entries - generally
*interfaces)










32768


(32768 is sufficient for current
largest system-X clusters. Minimum value is= larger extra buffer (2768) +
total interfaces in cluster which may require ARP entries - generally
*interfaces)










2000000


 

















net.ipv4.tcp_adv_win_scale







defines how much socket buffer
space is used for TCP window size vs how much to save for an application
buffer


2=1/4 space is app. buffer



  net.ipv4.tcp_low_latency







intended to give preference to low
latency over higher throughput; setting =1 disables IPV4 tcp
prequeue processing, which Mellanox has recommended
for large clusters



net.ipv4.tcp_mem







 IPV4 TCP memory usage
values:


min, pressure, max (in pages)



    min: no contraints below this
value



    pressure: threshold for moderating memory consumption




max: hard max



net.ipv4.tcp_reordering









net.ipv4.tcp_rmem







IPV4 TCP receive socket buffer mem
:



min, default, max



    min: Minimal size of TCP receive buffer



    default: initial size of TCP receive buffer (over-rides
mem value used for other protocols)



    max: max size of receive buffer allowed (limited by
)



net.ipv4.tcp_wmem









net.ipv4.tcp_sack









net.ipv4.tcp_timestamps









net.ipv4.tcp_window_scaling











style='page-break-before:always'>

All 'ib0' tuning recommendations should be made for ALL interfaces for which IP performance and reliability are important.   Here ib0 is just an example -- all interfaces for which critical subsystems (e.g. GPFS, LSF) have dependencies on should be tuned as per the ib0 examples except for cases which the tuning recommendations under (2) Less Reliable/Lower Bandwidth Networks are being followed and these recommendations conflict.



IPoIB



As per the sysctl recommendations, 'ib0'tuning recommendations should be made for ALL interfaces for which IP performance and reliability are important.   Here ib0 is just the example interface that most clusters are concerns with.



Unless the is a strong need for optimal IPoIB performance we are currently recommending using datagram mode on clusters.  Mellanox has agreed with this recommendation and points out that datagram in OFED 2 should be very close to the performance of connected mode.



We recommend:



/sys/class/net/ib0/mode = datagram



(again if there's an ib1 interface, /sys/class/net/ib1/mode should be set to datagram, etc.)



which is typically achieved by one of these two approaches, which must be applied to every IB interface, e.g. ib0, ib1, etc):



(1)  For qlogic/Intel, In the appropriate





(2)  For Mellanox adapters, in  /etc/:

SET_IPOIB_CM=no



IP Interface Tuning:



Mellanox Adapter Interface Tuning



We should verify that all IB IP interfaces match the recommended tuning.  The output of 'ip -s link'  returns the following (example for ib0 but ALL interfaces, e.g. ib1, it2, etc, need to be verified):



'ip -s link' example output for an (example ib0) interface:

"ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4092 qdisc pfifo_fast state UP qlen 16384"



Things to verify:

The IPoIB MTU should be: 4092  (put in /etc/modprobe.d/mlx4k.conf  set_4k_mtu=1  options mlx4_core set_4k_mtu=1)



Also for cases with large GPFS page pools, /etc/modprobe.d/mlx4k.conf should also set:

options mlx4_core log_num_mtt=20 log_mtts_per_seg=3





The IPoIB flags should be: BROADCAST,MULTICAST,UP,LOWER_UP



and the state of any working interface of course should be "UP"



The IPoIB interface QDISC has been tested on large clusters with the: pfifo_fast setting; however the mq (multi-queue) setting may be advisable on later adapters.



The IPoIB QLEN=16384



This has been tested on a cluster with more than 4000 nodes (smaller clusters may work well with smaller QLEN values but the overhead of increasing QLEN is believed not to be significant).  The size of the QLEN reported by ifconfig will be twice the size of the queue sizes defined in the /etc/modprobe.d/ib_ipoib.conf file. e.g.:



options ib_ipoib lro=1
send_queue_size=8192
recv_queue_size=8192


The txqlen and rxqlen values reported by ifconfig will be twice the values loaded by the driver.  The actual values that have been configured to the IB module can be determined by running:



cat /sys/module/ib_ipoib/parameters/

cat /sys/module/ib_ipoib/parameters/



Intel Adapter Interface Tuning



Define IP over IB receive queue length (on some IBM systems, we've set receive queue tuning in the /etc/modprobe.d/ib_ipoib.conf file) 



options ib_ipoib recv_queue_size=1024
send_queue_size=512






Also it is recommended that for ethernet adapters that have performance or reliability metrics, the length of the ethernet IP transmit and receive queues should be increased to 2048:



in /etc/rc.local:



ifconfig eth0 txqueuelen 2048   # will set the transmit queue to 2048 (if the adapter supports this length)



ethtool -G eth0 rx 2040               # will set the receive queue to 2048 (if the adapter supports this length)



(repeat for other ethernet devices, e.g. eth1 that support higher transmit and receive queue lengths):





The state of any working interface of course should be "UP"





Settings to avoid Linux Out of Memory Issues:



style='border-collapse:collapse;border:none;mso-border-alt:solid windowtext .5pt;
mso-yfti-tbllook:1184;mso-padding-alt:0in 5.4pt 0in 5.4pt'>

























/proc Parameter



Recommended


Value



Comments



Description



/proc/sys/vm/oom_kill_allocating_task



0





0=when OOM killer invoked it employes heuristics
to select a process making intensive memory allocations


1=when the OOM killer is invoked it kills the last process to allocate
memory




/proc/sys/vm/overcommit_memory



2



(some IBM clusters set this this value to 0, which may be
workloads malloc() much more memory than they touch as long as a




0= heuristic memory over- commit allowed



1=allocations  always succeed



2=allocations succeed up to  swap+(RAM*overcommit_ratio)




/proc/sys/vm/overcommit_ratio



99



Maximum -- 110


(the extent of memory over-commit is dependent on the discrepancy between
memory malloc'ed and touched - when running sparse
matrix applications - higher overcommit_ratio
values may be more appropriate)




this value is only relevant in cases in which sys/







Ulimits Tuning:



The following limits are recommended for default user limits on large clusters. Note that the ulimits are not a reliable method of enforcing memory limitations so it is recommended that ulimits be defined to effectively set unlimited memory limits and cgroups definitions be used to enforce memory limits. 

Set these values in /etc/security/limits.conf:



  *    soft    memlock      -1











































#Softwaredefinedstorage
#Workloadandresourceoptimization
#SpectrumScale
0 comments
39 views

Permalink