AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.


#Power
#Power
 View Only

Displaying Transceiver Statistics for Network Adapters

By Poudapelly Ravi Kumar posted 11 hours ago

  

Contributors: Ravi Kumar Poudapelly, Srikanth Kondapaneni

Introduction

Traditionally on AIX, transceiver statistics were collected and displayed using Advanced Diagnostics. To enter Advanced diagnostics mode, all IP addresses configured on these devices need to be tear-ed down. This causes application outage.

With this feature, device driver would:

A.       Collects and displays transceiver statistics as part of detailed ethernet statistics (collected using command entstat -d entX, where entX is network devices).

B.       Runs check on the collected transceiver statistics to determine the health of transceiver (health check). If any health check fails, corresponding error logs would be recorded. This helps customers and IBM support to isolate transceiver issues early on.

C.      Because “AIX snap” command (used to gather system configuration and statistics) has an existing way to collect detailed ethernet statistics, so, transceiver statistics would also be collected. This helps in saving time required to isolate transceiver issues seen by Customers.

D.      Transceiver statistics are collected when adapter is running in “Dedicated” (non-SRIOV) Mode.

Transceiver Statistics

Now, adapter driver would gather transceiver statistics and displays them as a new section named “QSFP Transceiver Statistics” was added at the end of detailed ethernet statistics output. Following information would be displayed in this section:
           i.    Transceiver Vital Product Data (VPD) Information:

           a.        Vendor Name

           b.        Vendor Part Number

           c.        Vendor Serial Number
           d.        Vendor Organizational Unique Identifier (OUI)

ii.    Transceiver Statistics:

           a.        Wavelength

           b.        Module Speed

           c.        Media Type

           d.        Temperature

           e.        Voltage

           f.         Rx Power

           g.        Tx Power

Detailed explanation for each field:

Field Name

Field Description

Vendor Name

Used to identify Transceiver Manufacturer

Vendor Part Number

Used to identify a Transceiver module uniquely

Vendor Serial Number

Used to identifies a specific transceiver module

Vendor OUI

Used to identify Transceiver Manufacturer

Wavelength

The specific optical frequency (in nanometers) at which the transceiver transmits light over Fiber.

Module Speed

Maximum data rate that the transceiver supports (essentially how fast it can transmit and receive data).

Media Type

Transceiver Physical Transmission medium. Transceivers are designed to either use Fiber (optic) or copper medium to transfer data.

Temperature

Current operating temperature of the transceiver, measured in degrees Celsius (°C).

Voltage

Real-time voltage level being supplied to the transceiver, measured in volts (V).

Rx Power

Measure of received optical signal strength from the remote device, measured in milli-Watt (mW).

Tx Power

Measure of transmitted optical signal strength to the remote device, measured in milli-Watt (mW).

 

Today this feature is enabled only for specified adapters:

a)        PCIe4 2-port 100 GbE RoCE x16 adapter (FC EC66 and EC67; CCIN 2CF3)

b)        PCIe4 2-port 100 GbE RoCE x16 adapter (FC EC75 and EC76; CCIN 2CFB)

c)        PCIe5 x16 2-port 200 GbE RoCE adapter (FC EC85 and EC86; CCIN EC2C)

Non-separable cables would not capture wavelength, Rx Power and Tx Power. When adapter is connected with non-separable cables, driver would not display these fields.

As this feature is not supported when adapter is running in SRIOV (Shared) mode, so, transceiver statistics would not be displayed for native VFs, vNICs and HNV.

Sample Transceiver Statistics:

To collect transceiver statistics, detailed ethernet statistics need to be run:

entstat -d entX
      entX is device name

 

Following is a snippet of transceiver statistics collected for QSFP-28 transceiver module:

QSFP Transceiver Statistics

-------------------------------------

Vendor Name : CISCO-AVAGO

Vendor Part Number : SFBR-89CDDZ-CS5

Vendor Serial Number : AVF2240S24W

Vendor oui : 00:17:6a

Wavelength : 850.000000
Module Speed : 104G

Media Type: 0XC  (MPO 1x12 (Multifiber Parallel Optic))

Temperature : 33.375000 C [Range:-5 c - 75 c]

Voltage : 3.292000 V [Range : 2.97 v - 3.63 v]

rx1_power :    0.691300 mW [Range: 0.037200 – 3.467400]

rx2_power :    0.682100 mW [Range: 0.037200 – 3.467400]

rx3_power :    0.598800 mW [Range: 0.037200 – 3.467400]

rx4_power :    0.489400 mW [Range: 0.037200 – 3.467400]

tx1_power :    0.971000 mW [Range: 0.037200 – 3.467400]

tx2_power :    0.913200 mW [Range: 0.037200 – 3.467400]

tx3_power :    0.915800 mW [Range: 0.037200 – 3.467400]

tx4_power :    0.867200 mW [Range: 0.037200 – 3.467400]


Health Check:

Health Check would generate an error when a transceiver parameter falls outside its defined limits.

This function is limited only to display errors to alarm customers; no action would be taken – Driver would continue to send and receive traffic, even with transceiver failure error. This error does not necessarily indicate a transceiver hardware failure. Please contact IBM Customer Support for further assistance and a detailed diagnosis. 

Transceiver statistics collection by the driver is limited to the execution of the entstat command; consequently, health checks occur only when this command is run. So, health check would not be done periodically.

Transceiver health-check errors are recorded with “MLXCENT_TRANSCEIVER_ERR” label and description as “Transceiver Error” in error log. Following is a sample error recorded by transceiver when temperature exceeded specified range (high temperature):

# errpt

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION

453B5292 0512051325 P H ent1 Transceiver error

453B5292 0512051225 P H ent1 Transceiver error

# errpt -aj 453B5292

LABEL:          MLXCENT_TRANSCEIVER

IDENTIFIER:     453B5292

Date/Time:      Mon May 12 07:05:15 CDT 2025

Sequence Number: 2642

Machine Id:     06C75A64R0A

Node Id:        p10ndd1lp10

Class:          H

Type:           PERM

WPAR:           Global

Resource Name:  ent1

Resource Class: adapter

Resource Type:  131519103596

Location:       U78D8.ND0.FG0B0AM-P0-C3-C0-T0

VPD:

  2-Port PCIe4 100Gb RoCE Adapter x16:

    Part Number.................01T740

    EC Level....................P14618

    FRU Number..................01T742

    Serial Number...............YAS6Y874043Z

    Feature Code/Marketing ID...EC66

    Customer Card ID Number.....2CF3

    Network Address.............903B052886

    ROM Level.(alterable).......001600352000

Description

  Transceiver error

Recommended Actions

  PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data

  File Name

    Line: 1006 file: entcore_ioctl.c

  MAC ADDRESS

    903B052886

DEVICE DRIVER INTERNAL STATE

  0030 0000 0002 0000 0000 0000 0001 0000 0000 0000 239B

PCI ETHERNET STATISTICS

  0061 0818 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

TRACE RECORD SEQUENCE NUMBER

  e:1:726 f:mlxcent_check_teciver r:0x0 s:0:0

  e:1:175 f:mlxcent_check_teciver r:0x0 s:0:0

  e:2:1106 f:entcore_ioctl r:0x9 s:0:0

NUMBER OF BYTES

  160

SENSE DATA

  061 0818 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

List of different transceiver errors:

Transceiver errors are embedded in error-log as follows:

DEVICE DRIVER INTERNAL STATE

XXXX XXXX XXXX XXXX YYYY YYYY YYYY YYYY ZZZZ ZZZZ ZZZZ ZZZZ

YYYY YYYY YYYY YYYY – This field represents type of transceiver error

Different type of transceiver errors was listed in the table below:

Error Type

Value of field
YYYY YYYY YYYY YYYY

Details

High Temperature Error

0000 0000 0000 0001

If transceiver’s current temperature exceeds its operating range.

Low Temperature Error

0000 0000 0000 0002

If transceiver’s current temperature is below its operating range.

High Voltage Error

0000 0000 0000 0003

If transceiver’s current voltage exceeds its operating range.

Low Voltage Error

0000 0000 0000 0004

If transceiver’s current voltage is below its operating range.

High RX1 Power Error

0000 0000 0000 0005

If transceiver RX-1 (First Receive Channel) power exceeds its operating range.

Low RX1 Power Error

0000 0000 0000 0006

If transceiver RX-1 (First Receive Channel) power is below its operating range

High RX2 Power Error

0000 0000 0000 0007

If transceiver RX-2 (Second Receive Channel) power exceeds its operating range.

Low RX2 Power Error

0000 0000 0000 0008

If transceiver RX-2 (Second Receive Channel) power is below its operating range

High RX3 Power Error

0000 0000 0000 0009

If transceiver RX-3 (Third Receive Channel) power exceeds its operating range.

Low RX3 Power Error

0000 0000 0000 0010

If transceiver RX-3 (Third Receive Channel) power is below its operating range

High RX4 Power Error

0000 0000 0000 0011

If transceiver RX-4 (Fourth Receive Channel) power exceeds its operating range.

Low RX4 Power Error

0000 0000 0000 0012

If transceiver RX-4 (Fourth Receive Channel) power is below its operating range

High TX1 Power Error

0000 0000 0000 0013

If transceiver TX-1 (First Transmit Channel) power exceeds its operating range.

Low TX1 Power Error

0000 0000 0000 0014

If transceiver TX-1 (First Transmit Channel) power is below its operating range

High TX2 Power Error

0000 0000 0000 0015

If transceiver TX-2 (Second Transmit Channel) power exceeds its operating range.

Low TX2 Power Error

0000 0000 0000 0016

If transceiver TX-2 (Second Transmit Channel) power is below its operating range

High TX3 Power Error

0000 0000 0000 0017

If transceiver TX-3 (Third Transmit Channel) power exceeds its operating range.

Low TX3 Power Error

0000 0000 0000 0018

If transceiver TX-3 (Third Transmit Channel) power is below its operating range

High TX4 Power Error

0000 0000 0000 0019

If transceiver TX-4 (Fourth Transmit Channel) power exceeds its operating range.

Low TX4 Power Error

0000 0000 0000 0020

If transceiver TX-4 (Fourth Transmit Channel) power is below its operating range

This feature is available to user from AIX 7300-04 and VIOS 4.1.2.0 release onwards.

0 comments
3 views

Permalink