AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.

 View Only

AIX Support for Fabric Congestion Notification

By Jim Allen posted Wed November 11, 2020 03:46 PM

  

The T11 Fibre channel standards group has introduced a new fabric congestion notification mechanism in FC-LS-5, which adds a new Extended Link Service (ELS), Fabric Performance Impact Notification (FPIN). When the fabric detects congestion/link issues, the fabric sends an FPIN ELS to all N_Ports that registered to receive FPINs. Thus N_Ports (such as HBAs) must register to receive FPIN ELS. FPIN ELS provides three categories of notifications/events from the fabric:

  • Congestion – indicates a link is overused. This event may be generated repeatedly until congestion subsides.

  • Link Incident – indicates a threshold has been exceeded for the link: such as CRC errors etc.

  • Discarded FC frame(s) – indicates the fabric has dropped frame(s) to specific targets.

In October 2020, Brocade added support for FPIN ELS via FOS 9.0 and higher in their switches. The new AIX 7.2 TL 5 and VIOS 3.1.2 also add support for FPIN ELS on all 16Gb (and faster) FC adapters. This new support includes AIX 7.2 TL 5 NPIV clients, provided that client is attached to VIOS 3.1.2.

SETUP

AIX/VIOS will automatically check for the FPIN ELS support in the fabric and if it is available will register to receive FPINs. Thus there are no changeable settings in AIX required to enable FPIN support: it is automatic.

MPIO (Multi-path I/O) support

FPIN ELS for congestion and link incident events are passed to the AIX MPIO (Mult-path I/O) layer's Active/Active PCM (Path Control Module), which is shipped in base AIX. In general the Active/Active PCM will treat impacted paths as “Degraded” paths, meaning that it selects other paths for I/O whenever there are other paths that are not degraded. Furthermore the lsmpio command has been enhanced to display the following new values in the extended path_status field to indicate these events/states:

  • LCn – link between HBA and switch is congested

  • PCn – Link between switch and storage target port is congested

  • PDg – A link experienced a link incident event (i.e. too many CRC errors etc)



For congestion events, the Active/Active PCM automatically clears the congestion indication on a path if congestion notifications for that path are not reported after a certain interval. A link incident event is cleared by the Active/Active PCM when a link bounce for that link has been detected.

Here is a link to a youtube video demo of FPIN from Brocade using AIX 7.2. TL 5: https://www.youtube.com/watch?v=RNoMMfviJ-Q&feature=youtu.be



7 comments
125 views

Permalink

Comments

Sun July 24, 2022 08:49 PM

Thanks Jim - so I understand multipath will mark paths as degraded, are there any events written to errpt that one can monitor for ?  eg Monitoring a VIO Server ? or is it reliant on the Guest (such as RHEL 8.3) performing actions and it's almost a 'silent' event to the user  - what about event support in HBAAPI Library - will that be added as a 'listen' event (or should I RFE it) ?

Sun April 17, 2022 10:28 AM

Thanks Jim.

Wed April 06, 2022 10:18 AM

Hi Gerry,
AIX is not throttling I/O link speed for congestion issues. So HBAs should not be doing that, unless there is an issue with link training.

Wed April 06, 2022 02:22 AM

The over subscription is always occuring in real fabric environment. Can the HBA driver throttle the IO speed? I found the new generation Emulex and QLogic HBAs have throttle the IO function for SAN congestion. many Thanks.

Tue April 05, 2022 10:48 AM

Hi Gerry,
If the full read over subscribes the fabric, then Fabric Notification would send events to AIX, which would try to use a different storage target port (assuming one is accessible from the host).  If the over subscription is not occuring in the fabric itself, but is only at the target port of the storage, then it depends on how the storage target port handles this. Typically a storage target port should respond with Queue Full status, which will cause AIX to back off and look for other paths. Optionally a storage target port could also support Fabric Notification and send these events to the AIX host, which would also cause AIX to attempt alternate paths.

Mon April 04, 2022 05:58 AM

How about the congestion issue handle (full read throughput then impact other application systems IO access in shared storage port)  ? many Thanks.

Sun December 13, 2020 10:36 PM

This is very cool. Thank you!