File and Object Storage

File and Object Storage

Software-defined storage for building a global AI, HPC and analytics data platform 

 View Only

A Complete Guide to - Protocol Problem Determination Guide for IBM Spectrum Scale™ - Monitoring

By Archive User posted Tue December 19, 2017 09:27 AM

  
Hello Everyone,
In this article I will discuss the various possible events and causes for Authentication or Protocol issues and how we can monitor and determine the root cause for these issues.

I have divided this topic into three parts and will describe each of below in new blog article.
1. Monitoring IBM Spectrum Scale™ for its protocols/Authentication Components
2. Log collection of the issue using available methods.
3. Some Known Use Cases.

In this blog post I will cover Part 1 which is Monitoring and in Part 2, I will continue with Log Collection and so on..

Monitoring IBM Spectrum Scale™:


Now, to monitor the different components we can use the CLI command:
# mmces state show
This command will display the state of each component in the following format:

NODE | AUTH | BLOCK | NETWORK | AUTH_OBJ | NFS | OBJ | SMB | CES
cesnode1 | HEALTHY | DISABLED | HEALTHY | DISABLED | HEALTHY | DISABLED | HEALTHY | STARTING


Here, the components related to Authentication are the following sub-services:

AUTH – Tasks: Monitors LDAP, AD and or NIS-based authentication services.
AUTH_OBJ – Tasks: Monitoring the OpenStack identity service functionalities.
NETWORK - Tasks: Monitoring CES network-related adapters and IP addresses.
BLOCK - Tasks: Checks whether the iSCSI daemon is functioning properly.
NFS - Tasks: Monitoring NFS-related functionalities.
OBJECT - Tasks: Monitors the IBM Spectrum Scale for object functionality.
SMB - Monitoring SMB-related functionality like the smbd process, the ports and ctdb processes.


Few of the above components are affected when we see an issue in the FILE Protocol stack for IBM Spectrum Scale™.
Each component can be in any of these states:

HEALTHY - The component is working as expected.
DISABLED - The component has not been enabled.
SUSPENDED - When a CES is in suspended state, most components also report suspended.
STARTING - The component (or monitor) recently started. This state is a transient state that is updated after the startup is complete.
UNKNOWN - Something is preventing the monitoring from determining the state of the component.
STOPPED - The component was intentionally stopped. This situation might happen briefly if a service is being restarted due to a configuration change. It might also happen because a user ran the mmces service stop protocol command for a node.
DEGRADED - There is a problem with the component but not a complete failure. This state does not cause the CES addresses to be reassigned.
FAILED - The monitoring detected a significant problem with the component that means it is unable to function correctly. This state causes the CES addresses of the node to be reassigned.
DEPENDENCY_FAILED - This state implies that a component has a dependency that is in a failed state. An example would be NFS or SMB reporting DEPENDENCY_FAILED because the authentication failed.


If a component is in state FAILED it means that an issue has been detected that has caused the node to be failed. To return this node to service resolve the issue that caused the failure. Please ensure that the command is executed on a CES node and IBM Spectrum Scale is started on this node.

Now lets look at those components under FILE Protocol that could fail:

Authentication


SSSD Process not running (sssd_down)
YPBIND process not running (yp_down)
Cause: SSSD or YPBIND process is not running.
Determination: "mmces state show auth" to understand if the auth current state
"mmces events active auth" to understand if active events for auth
"mmuserauth service list" to understand the current authentication configuration
"mmuserauth service check -N cesNodes --server-reachability" to understand the state of authentication configuration across the cluster
Solution: "mmuserauth service check -N cesNodes --rectify" to rectify the configuration.
Note: Server reachability cannot be recitified using the --rectify flag.


SMB


Winbind process not running (wnbd_down)
Cause: Winbind process is not running.
Determination: Same as above
Solution: In addition to the above steps one is required to run "mmces service stop smb -N


NFS (Error events)


NFS is not active (nfs_alive_down)
Cause: Statistic query indicates Ganesha is not responding
Determination: Investigate NFS logs to determine cause (TODO: Add link to guide)
Solution: Restart Ganesha on the local CES node using commands
a) # mmces service stop nfs
b) # mmces service start nfs


Ganesha NFSD process not running (nfsd_down)
Cause: Ganesha server process is no longer running
Determination: Investigate NFS logs to determine cause (TODO: Add link to guide)
Solution: Restart Ganesha on the local CES node using commands
a) # mmces service stop nfs
b) # mmces service start nfs


Portmapper port 111 is not active (portmapper_down)
Cause: RPC call to port 111 failed or timed out
Solution: Check if portmapper is running
Check if portmapper (rpcbind) is configured correctly (to automatically start on system startup)


#IBMSpectrumScale
#Softwaredefinedstorage
#IBMSoftwareDefinedStorage
#ibmstorage
#IBMSpectrumScale
0 comments
9 views

Permalink