Sterling Data Exchange

 View Only

Global Mailbox: Monitoring

By Scott Guminy posted Fri December 06, 2024 12:06 PM

  

Global Mailbox:  Monitoring

What is Global Mailbox?

IBM Sterling Global Mailbox helps companies address demands for high availability operations and redundancy with a robust and reliable data storage solution available across geographically distributed locations.  It is an add-on to Sterling B2B Integrator and Sterling File Gateway.

How does Global Mailbox work?

Global Mailbox uses several key concepts to provide a highly resilient B2Bi or SFG deployment.

Redundancy

Each deployment includes multiple instances of each component within the datacenter to ensure that services are always available.

The solution is deployed across multiple data centers to ensure that if there is a full data center outage, there’s another data center to accept requests and provide business continuity.

Data Replication

Mailbox data is replicated within and across data centers to reduce the risk of losing data. The system always stores multiple copies of mailbox data across multiple servers.

What components are used in Global Mailbox?

A typical Global Mailbox production environment includes several components that facilitate redundancy and data replication.  Below is the recommended topology for a 2 Data Center deployment.

See Global Mailbox:  Components and deployments for more details on each component and the role they play.  This article will discuss how to monitor the various components.

What is IBM Control Center Monitor?

IBM® Sterling® Control Center Monitor tracks the critical events across your B2B and managed file transfer (MFT) infrastructure for improved operations, customer service and B2B governance.

Global Mailbox integrates with IBM Control Center Monitor.  The system sends events to Control Center to share information about the health of the Global Mailbox system. Administrators can set up alerts for various situations that may occur. 

It’s very important to react to outages in a timely manner.  If outages are not solved within certain timeframes data such as deleted/processed files can be resurrected and cause difficulties to your business.

Out-of-box Monitoring of external Global Mailbox components

IBM Control Center can monitor components external to Global Mailbox.  If Global Mailbox detects that an external component is not working, it will send an event to Control Center so that someone can be alerted.

Apache Cassandra

Cassandra is a NoSQL, replicated, fault-tolerate database.  Global Mailbox meta-data is stored in Cassandra.  A Cassandra cluster includes many nodes in each data center (see deployment diagram). 

Global Mailbox does not monitor each individual Cassandra node.  Instead, Global Mailbox sends events to Control Center if the connections or queries fail for any reason. Because of this, Control Center represents Cassandra as a single Service, rather than a collection of nodes.

Each Global Mailbox node reports Cassandra status to Control Center.  If the Global Mailbox node indicates that Cassandra isn’t working, Control Center will indicate which Global Mailbox node is having problems working with the Cassandra service.

A node could report a problem with Cassandra for various reasons:

  • Network issues preventing Global Mailbox Nodes from connecting to Cassandra
  • Improper SSL configuration preventing connection
  • Cassandra nodes being down
  • Queries failing due to consistency problems
  •  etc.

Here’s a view of how Control Center shows connections to Cassandra.  The black lines between the 2 nodes indicate that the connection between those nodes and the Cassandra cluster is healthy.  The red lines between a node and a service indicate that there is an issue when a node is using the service, or that the status is not known:

As shown above, both Global Mailbox nodes have healthy connections to the Cassandra service within their data center. 

Note: the Global Mailbox node is showing a red line to the other data center.  This line represents the Aspera FASP connections which are needed for replication.  Since this node has not replicated any files after startup, the status is “unknown”.

If enough Cassandra servers go down, the Global Mailbox Client Adapter (GMCA) will have problems querying Cassandra:

You can “drill down” on a connection to reveal more information by clicking on the line.  Here’s what you see when you click on the red line between the Global Mailbox node and the Cassandra service:

The reason for the failure is that not enough replicas are available to achieve LOCAL_QUORUM consistency.  This is shown as the Gm/Reason in the More Information table

You can be notified specifically of a Cassandra problem by setting up an alert for the message ID GMCAS0001E:

Cassandra is fundamental to the operation of any Global Mailbox node.  If Cassandra (or any other required service) is not working the node is also marked in ERROR state. This is shown with the yellow warning overlay on the node's icon:

You can drill down to see additional details on the node.  Here you see that Global Mailbox sent a GMMGT0001E event to indicate that the node is in ERROR state:

Instead of monitoring for a Cassandra error, you can monitor for GMMGT0001E which indicates the Global Mailbox node is in error.  This would cover all reasons why this node is not working.

Apache ZooKeeper

ZooKeeper is a distributed co-ordination service.  In a distributed computing environment ZooKeeper helps co-ordinate activities across nodes and data centers. Each data center requires multiple ZooKeeper servers to ensure the system is resilient. 

Global Mailbox does not monitor each individual ZooKeeper node.  Instead, Global Mailbox sends events to Control Center if the ZooKeeper actions fail for any reason. Because of this, Control Center represents ZooKeeper as a single Service, rather than a collection of nodes.

Each Global Mailbox node reports ZooKeeper status to Control Center.  If the Global Mailbox node indicates that ZooKeeper isn’t working, Control Center will indicate which node is having problems working with the ZooKeeper service.

Like the Cassandra service, if a node reports that ZooKeeper is not working, Control Center will show a red line between that node and the ZooKeeper service:

When you drill down by clicking on the red line, you can see the reason why ZooKeeper operations are failing:

You can create an alert for ZooKeeper problems by monitoring for the message ID GMZOO0001E.   

ZooKeeper is a required service for all GM Nodes.  If ZooKeeper is not working, Global Mailbox will send events to Control Center indicating that the Global Mailbox nodes are in ERROR state.

Shared disk

The shared disk is used to store payloads for files that are uploaded to Global Mailbox.

If the shared disk is not readable or writable, a Global Mailbox node will send events to Control Center indicating the shared disk has problems for that node.  The Global Mailbox node will then go into ERROR state until the problem is resolved.

IBM MQ

Global mailbox uses IBM MQ to tell B2Bi/SFG that a file has arrived and needs processing/routing. It does this by putting an MQ Event on a queue.  If IBM MQ is down, these events will not be sent and files will not be processed.

If a Global Mailbox node cannot put events on the queue, that node will send an event to Control Center indicating the IBM MQ service is not working. The Global Mailbox node will then go into ERROR state until the problem is resolved.

Data Centers

Control Center provides an overall view of the status of Global Mailbox data centers. 

The Control Center dashboard shows a circle for each Data Center.  The circle is divided into sections for each node in the Data Center.  If a node is down or the node cannot use one of the required services, that section will be yellow or red.  If a node is up and working fine, the section will be green.

Here's an example of a 2 Data Center deployment.  In Data Center DC1, one node is having problems with a required service.

Clicking on the warning section of the DC1 circle allows you to drill down to the details.

Global Mailbox Nodes

Control Center shows 2 Global Mailbox "Nodes" for each SFG/B2Bi installation:

  1. The Global Mailbox Admin Node

    This is the WebSphere Liberty process that runs the Global Mailbox UI, Scheduler and Replicator.

  2. The Global Mailbox Client Adapter

    This is the adapter inside B2Bi/SFG that protocols and the business process engine use to interact with Global Mailbox.  If this adapter is not running B2Bi/SFG cannot perform any Global Mailbox actions.

Every SFG/B2Bi installation has these two nodes.  Both nodes will be indicated with the SFG/B2Bi hostname.  The Global Mailbox Client Adapter is indicated with the text "(GMCA)", and the GM Admin node is indicated with the text (GM).

Each of these Nodes sends heartbeat events to Control Center.  If Control Center stops receiving heartbeat events from a node, it will mark that node as DOWN.  Additionally if a required service is not working (Cassandra, ZooKeeper, etc) for that node, the node will be put in ERROR state until the problem is resolved.

Out-of-box Monitoring Global Mailbox activities

Global Mailbox sends events to Control Center about specific activities you may want to monitor.

Message creation

This tracks the creation of a message (file) in Global Mailbox.  You can use this to track all the files uploaded to the system. 

In Control Center a “Process” is created for each message creation. To find all Processes, open the Control Center UI, choose Monitor, then choose Completed Processes on the right. All Processes will be displayed in a table

The process name for Global Mailbox message creation is “CreateMessage”.  You can use the filtering capability to add a filter to show only these processes.

You can drill down on the CreateMessage process by clicking the Process ID to see any additional details.

Payload replication

This tracks the replication of payloads between data centers.

The process name for Global Mailbox payload replication is “ReplicatePayload”.  You can use the filtering capability to add a filter to show only these processes.

You can drill down on the payload replication to view how many segments are part of the payload and other details.

If replication is important to you, you may want to monitor these processes to ensure that files received are replicated to all data centers.

Event publishing

This tracks the sending of events to trigger file routing/processing.

The process name for Global Mailbox event publishing is “PublishEvent”.  You can use the filtering capability to add a filter to show only these processes.

What components do not have out-of-box monitoring?

Cassandra Reaper

Cassandra Reaper is an open source software package which constantly performs repairs to resynchronize nodes.  The reaper starts when you start Cassandra.

It is important that the Reaper is running at all times to avoid any data from being resurrected. 

If data is resurrected, files may reappear and be processed again.  This could have significant business impact. It can be difficult to clean up resurrected messages since Cassandra doesn’t provide any indication of resurrected data.

The Reaper has a web UI  and a command line tool to show the status of repairs of the Cassandra cluster.  See the Global Mailbox documentation on the Reaper for more information.

Control Center DOES NOT monitor the Reaper.

It’s very important to ensure that the Reaper is running and that it is doing repairs.  There are several ways to monitor the Reaper:

The Web UI

This UI will indicate if repairs are happening.  However, relying on a person to check the Reaper UI regularly is not a great solution.

Process Monitoring

You could monitor if the Reaper process is running.  This approach has some flaws.  The process may be running but might not actually be running repairs due to errors

Log File Monitoring

Monitoring the log file for errors is a great approach to determine if repairs are not working due to errors.

Recommendation

Use a combination of Process Monitoring and Log File Monitoring to ensure the Reaper is always running and not having any errors during the repair process.

ZooKeeper Watchdog

The ZooKeeper watchdog is specific to Global Mailbox and not part of standard ZooKeeper installations.  The watchdog monitors the ZooKeeper ensemble.  When it detects that quorum is lost, it temporarily removes unreachable nodes from the ensemble so that quorum can still be achieved. Once the nodes come back, they are re-added to the ensemble.

The watchdog must always be running.  It is required in cases of network partition to keep the system functioning. 

Control Center does not monitor the ZooKeeper Watchdog.

Use process monitoring and log file monitoring to ensure the ZooKeeper Watchdog is running and does not have any errors.

Enabling out-of-box Monitoring with IBM Control Center

By default, monitoring of Global Mailbox is not enabled.

After you have installed and configured IBM Control Center, you can configure Global Mailbox to start sending events.

See Monitoring with IBM Control Center in the Global Mailbox documentation for details on how to enable monitoring.

You can enable the following monitoring:

System heartbeats

Heartbeat events tell Control Center that the internal Global Mailbox components (Global Mailbox Admin Server, Global Mailbox Client Adapter) are running.  Events are sent periodically so that Control Center knows these components are running.  If Control Center does not receive an event, it could mean these components are not running.

System components

Component events tell Control Center if the external components (Cassandra, ZooKeeper, etc) are operational from a Global Mailbox perspective.

System activity

Global Mailbox will send events about Global Mailbox activities such as MessageCreation and PayloadReplication.

Additional considerations for Apache Cassandra and ZooKeeper

The approach Global Mailbox takes to monitoring is to send events to Control Center when the Cassandra service or ZooKeeper service has stopped working.

This means that a problem must occur before Control Center is aware of it.  In normal situations another data center will take over and that data center may not have any problems with Cassandra or ZooKeeper.

It is very important to react to Cassandra node failures early.  The primary reason is that if a node is down too long, it could result in data being resurrected when that node is brought back. 

With the out-of-box monitoring, a single Cassandra node failure won’t trigger a “Cassandra down” event in  Control Center.  This is because of Cassandra’s redundancy. Global Mailbox nodes would continue to work and you would never know you’re in a situation where data could be resurrected.

It might be advantageous to increase the robustness of your monitoring of Cassandra and ZooKeeper by monitoring individual nodes.  This monitoring should ensure that the nodes are running (the process is running) and that there are no errors in the log files.

Summary

Monitoring of any system is important to ensure the system is healthy and functioning properly.  Global Mailbox integrates with IBM Control Center Monitor to allow you to setup alerts for various failures that may impact your business continuity.

However, not all components are monitored and it’s important increase the robustness of the monitoring with custom approaches to ensure that you react quickly to problems and avoid data resurrection.

0 comments
13 views

Permalink