File and Object Storage

Network requirements in an Elastic Storage Server Setup

By Archive User posted Wed December 13, 2017 07:10 AM

To operate effectively, Elastic Storage Server (ESS) requires three main networks and a fourth one for connecting into the customer campus network. Of the three networks, the Service Network and the Management/provisioning network are internal to the ESS rack. They are basically non-routable i.e. they do not exist on the customer local area network (LAN).
1. Service network: Each IBM Power System server node (Executive Management Server (EMS) and I/O server nodes) have a flexible service processor (FSP), and the Service network connects them together. When the Power server is configured with RHEL Big Endian (BE), the systems are managed by the Hardware Management Console (HMC). HMC acts as Dynamic Host Configuration Protocol (DHCP) server and assign IP addresses to FSP. 
When the Power server is configured in OPAL i.e. RHEL Little Endian (LE), the systems are managed by the second connection on the management server. The IP addresses are assigned by management server.
2. Management and provisioning network: This network connects the management server (EMS) and the I/O server nodes for OS provisioning and xCAT management of the cluster. The management server is the DHCP server for this network. All the ESS code provisioning happens on this network. This network is common for both big-endian and little-endian servers.
3. Clustering network: This is a high-speed network with either 10GbE/40GbE/100GbE and/or InfiniBand. The cluster is created and managed over this network. This network is also used to connect IBM Spectrum Scale clients to the ESS storage cluster.
4. External and campus: The fourth network connects the Executive Management Server (EMS) and HMC (when required) to the client’s public network for management purposes, as required by the organization.

The following illustrations are pictorial representations of these networks, taken from ESS Quick Deployment Guide. The HMC and the Management/Provisioning network are internal to the ESS rack. The 10/40/100 GbE or InfiniBand networks are customer provided networks used for cluster networking of ESS.

Figure 1. The management and provisioning network and the service network: a logical view (on PPC64BE)

Figure 2. The management and provisioning network and the service network: a logical view (on PPC64LE)
In these illustrations, each of these networks are implemented as non-overlapping networks, usually configured as individual virtual local area networks.
By default, the ESS solution order includes 1GbE Ethernet switch which take care of Service network and Management network (xCat) and therefore in most cases customer do not have to explicitly configure these two networks. The solution comes with preinstalled cable connections for these two networks. Configuration required by the external campus network is minimal and limited to connecting the ports to external network and configuring the IPs as per the campus network configuration rules. For cluster networking, customers can either use an existing high speed network infrastructure or procure a new switch from IBM.
During the ESS deployment all the console logs and var log messages from the ESS I/O nodes are transferred to EMS via Management / Provisioning network. Therefore there is empty /var/log/messages file on the I/O server. Do note to always include EMS node in the snap when sending gpfs.snap logs for any issues.

Connectivity between ESS I/O node and Disk enclosure

Each I/O node has 3 SAS HBA Cards with 4 ports each. Each enclosure has two environmental service modules (ESM) A and B, each with 2 SAS ports. To ensure proper multi-pathing and redundancy, every I/O server is connected to ESM A and ESM B using different HBAs. This provides full redundancy at HBA, I/O nodes and ESM level. All the SAS cabling is done at manufacturing.

Clustering network

The IBM Spectrum Scale cluster is created and managed over the Cluster network. This network can be logically divided into two types of networks, admin network and demon network. These two logical networks can run on same physical network or different networks depending on client requirements.

Admin Network –

Most of the admin commands which require cluster-wide information or information from a node other than the one on which command is executed, use socket communications to process GPFS administration commands. Depending on the nature of the command, GPFS may process commands either on the node issuing the command or on the file system manager. The communication related to GPFS administrative command can take place over a separate network and can be configured using AdminNodeName field. For more information, see the Network communication and GPFS administration commands topic in the IBM Spectrum Scale Knowledge Center.

Demon Network -

All IBM Spectrum Scale clients (nodes running IBM Spectrum Scale software) talk to all NSD servers (IBM Spectrum Scale node serving file system data) over the Demon network.

In typical GPFS environment there are many Spectrum Scale clients to few Spectrum Scale NSD servers, for networkint this means fan-in and fan-out traffic patterns. In the ESS setup, NSD servers are ESS I/O nodes, they serve recovery group and vdisks in that recovery group. There is one-to-one mapping between vdisks in ESS to IBM Spectrum Scale NSDs.
What we discussed about the Demon Network is common for both Spectrum Scale and ESS. ESS also has a special network requirement which is not present in a regular SAN-based NSD. In an SAN-based NSD server there is very little traffic between one NSD server to another NSD server, however in case of ESS, the traffic is heavy between two ESS I/O node in the same building block. This communication is done using the same demon network. The amount of traffic between two ESS IO nodes in a building block depends upon the write I/O pattern of the IBM Spectrum Scale client.

What write pattern caused more traffic between ESS IO Node in Building Block?

ESS uses a sophisticated mechanism for write operations. This mechanism uses various media types in ESS, such as NVRAM, SSD, or Rotating Disk, to give the best performance for all types of writes. When the writes are small (generally < 256K but can be changed by the configuring parameter nsdRAIDSmallBufferSize) they are written to small NVRAM buffer on an ESS I/O node and mirrored to the second ESS I/O node of the building block for data protection in the event where one I/O node fails.
If the write pattern is dominated by small writes there will be heavy traffic between ESS I/O nodes of the building block.
To improve the write performance, it is recommended to have a dedicated (preferably IB) network between the ESS I/O nodes. IBM Spectrum Scale supports communication between multiple demon networks using GPFS subnets. If the IB network is used you can either use TCP/IP over InfiniBand or RDMA InfiniBand protocol based on the VERBS API.

How to determine write pattern

The viostate command under the sample vdisk directory can help you to identify read, write and various types of write’s as implemented in ESS solution:.
The main columns in the viostat command output are:

  • The total number of read operations

  • The total number of short write operations.

  • The total number of medium write operations.

  • The total number of promoted full track write operations.

  • The total number of full track write operations.

  • The total number of flushed update write operations.

  • The total number of flushed promoted full track write operations.

  • The total number of migrate operations.

The short write received from the IBM Spectrum Scale client is first written to NVRAM of ESS I/O node and synchronously mirrored on to the second ESS I/O node in the building block. The write is acknowledged only after it is written on both the servers. This type of I/O pattern creates a lot of networking traffic between these two nodes.

The following example shows the write pattern with a small write:

In this example, tcpdump from one of the ESS I/O node i.e has been taken. is the IP address of the client and and are IP addresses of ESS IO nodes of the building block. The client is doing a small write on the file system served by ESS I/O nodes. As the write needs to be mirrored on to the second ESS I/O node, the sent and received traffic between I/O nodes is almost double than traffic from the client.

The second example is where the client is doing a full track write. In this case, the traffic is light between the two I/O nodes because the writes are not mirrored but directly written on to the disk.

Here is the output of tcpdump conversation, between the client and the ESS I/O node. There is not much traffic between ESS I/O server nodes.


Networking is an integral part of any Spectrum Scale / ESS deployment, by proper planning of network customer can realizing the full performance gains that an Spectrum Scale/ESS system can bring. If you need any help in planning/tuning Spectrum Scale deployment or any technical support, please reach out to your IBM representative.