B2B Integration

 View Only

Global Mailbox: All about Cassandra

By Scott Guminy posted Wed January 05, 2022 08:00 AM

  

This is part of a multi-part series on IBM Sterling Global Mailbox.

What is Global Mailbox?

IBM Sterling Global Mailbox helps companies address demands for high availability operations and redundancy with a robust and reliable data storage solution available across geographically distributed locations.  It is an add-on to Sterling B2B Integrator and Sterling File Gateway.

What is Cassandra?

Cassandra is an open source, fault-tolerant, distributed (replicated), NoSQL database.  It is the database used by Global Mailbox to store metadata.

For Global Mailbox, multiple Cassandra nodes are deployed in multiple data centers.  The Cassandra nodes replicate data within and across data centers. If one Cassandra node goes down, there is another node which can fulfill the request.

How is data organized in Cassandra?

Cassandra organizes data first by keyspaces.  Each keyspace has a “replica count” which determines how many copies of the should be stored in each data center of the cluster. For example, the mailbox keyspace might be configured with 3 copies of the data in DC1 but 2 copies in DC2.  Keyspaces are similar to schemas in a relational database.

Each keyspace holds a set of tables, indices, etc., similar to a relational database.  Tables are organized into columns.  Each table has one or more columns that comprise the primary key.

All data is assigned a timestamp.  Due to the distributed, fault-tolerant nature of Cassandra, different nodes can have different copies of the data. When this happens, the timestamp is used to resolve the conflicts. The data with the latest timestamp wins.  Therefore, with deployments of global mailbox, it’s mandatory that all nodes (B2Bi and Cassandra) have their timestamps synchronized as closely as possible to avoid problems with conflict resolution.  If the nodes do not have synchronized clocks, unexpected behaviour can happen.

While Cassandra has tables like a relational database, it has many differences.  For more information about Cassandra see the Cassandra Documentation on the Apache website.

What data is stored in Cassandra?

Mailbox data can be broken into two categories:

  • Payload: the payload is the file content itself, for example the contents of a payment file.  Payloads are stored on disk and referenced by the metadata.
  • Metadata: metadata is data that describes the mailboxes, mailbox hierarchy, mailbox permissions, messages, etc.  Essentially, it’s everything but the payload.

Metadata is stored in Cassandra and Cassandra ensures it’s replicated across all data centers.

What is the CAP Theorem?

Global Mailbox is a distributed datastore which stores copies of mailbox data across nodes in many data centers.  The CAP theorem describes the properties and limitations of a distributed datastore.  This theorem drives the capability and trade-offs in Global Mailbox.

The CAP theorem says you can only guarantee two of the following at any given time:

Consistency

Every read must return the most recent data or an error.

Availability

Every request to read or store data receives a response, even if it may not have the most recent data. Sometimes referred to as “fault-tolerance”.

Partition Tolerance

The datastore continues to function despite any arbitrary number of messages being lost or delayed.  A good example of this is when two data centers cannot talk to each other due to a networking problem.  This is called a network partition.

How to choose two of the three guarantees?

Many would argue that network communication is inherently flakey.  There can be delays or networks can be down.  In most distributed deployments, you must tolerate network partitions in your environment.

This leaves a choice between guaranteeing Consistency (always reading the most up to date data) or Availability (always getting a response from the datastore but maybe with outdated information).

With Cassandra, you don’t have to choose one or the other.  Cassandra offers tunable or configurable consistency levels.  There are multiple levels of consistency, each with benefits and limitations.

What are Consistency Levels?

Cassandra supports several different consistency levels.  Consistency levels determine how “syncrhonized” the data should be across the cluster.  Or in other terms:  how may nodes need to be running for a query to work.

Consistency levels for Cassandra are based on the replication factor and the number of data centers. For a Global Mailbox production environment, the typical deployment involves the following:

  • 2 data centers (minimum)
  • 3 Cassandra nodes in each data center (minimum), for a total 6 across all data centers
  • Replication factor of 3 in every data center

This results in every node in every data center having a copy of all data.

What consistency levels does Cassandra support?

Here are some consistency levels that Cassandra offers and how they would work.  The recommended Global Mailbox configuration is highlighted in bold. All examples assume the replication factor is equal to the number of nodes.

ONE

With this consistency level, only one node must be operational for requests to be successful.

For read requests, Cassandra will read data from any ONE node.  Typically, this is whichever node responds quickest. This node may not have the most up to date copy of the data being asked for.

For write requests, Cassandra will send the write request to ALL nodes but wait for only ONE node in any data center to confirm it has stored the data.

A consistency level of ONE favours availability over consistency.  Read requests could easily get stale data but as long as one server is operational, the query will work.

Examples for consistency level ONE

Total Number of Nodes across all data centers

Number of operational nodes required to achieve consistency level ONE

Maximum number of nodes which can go down and not impact queries

2

1

1

3

1

2

4

1

3

5

1

4

6 (recommended production deployment)

1

5

 

 

QUORUM

With this consistency level, Cassandra wants to ensure that a majority of nodes across all data centers have the data.

For read requests, Cassandra will read data from a QUORUM of nodes.  Typically, whichever nodes respond quickest. Each node may have a different view of the data.  The data from all nodes which responded is consolidated and conflicts are resolved by timestamp. The data with the latest timestamp is returned.

For write requests, Cassandra will send the write request to ALL nodes but wait for only a QUORUM of nodes from all data centers to confirm they have stored the data.

Assuming the replication factor is equal to the number of nodes, you can use this formula to calculate QUORUM:

(Total # of nodes across all data centers/2) + 1 then round down to the nearest whole number

Examples for QUORUM

Total Number of Nodes across all data centers

Number of operational nodes required to achieve consistency level QUORUM

Maximum number of nodes which can go down and not impact queries

2

2

0 (no fault-tolerance)

3

2

1

4

3

1

5

3

2

6 (recommended production deployment)

4

2

 

Note that if you have 2 nodes across all data centers, quorum is 2.  In this case you have zero fault tolerance.

A consistency level of QUORUM is a good balance between consistency and availability. 

From a consistency perspective, with the recommended production deployment for Global Mailbox, Cassandra ensures that copies of the data are always stored in more than one data center.  Cassandra would guarantee 4 copies of the data.  Either 3 in DC1 and 1 in DC2, or 2 in DC1 and 2 in DC2.

From an availability perspective, with the recommended production deployment for Global Mailbox, you can lose 2 Cassandra nodes across all data centers and still satisfy a quorum of 4.  However, in a recommended 2 data center deployment, you cannot lose an entire data center and continue to be operational.  This is because quorum is 4 but there will be only 3 nodes available in the remaining data center.  If you expanded to 3 data centers with 3 Cassandra nodes each, quorum becomes 5 (9/2 +1, rounded down).  With this configuration you can lose an entire data center and still have 6 nodes available to achieve a quorum of 5.

 

LOCAL_QUORUM

This consistency level is similar to QUORUM, however it is calculated using only the replication factor and number of nodes in the LOCAL data center.  The local data center is the one where the query is initiated.  All nodes in all other data centers are not required to be running for LOCAL_QUOURM to be achieved.

For read requests, Cassandra will read data from a QUORUM of nodes in the LOCAL data center.  Typically, whichever nodes respond quickest. Each node may have a different view of the data.  The data from all nodes which responded is consolidated and conflicts are resolved by timestamp. The data with the latest timestamp is returned.

For write requests, Cassandra will send the write request to ALL nodes but wait for only a QUORUM of nodes in the LOCAL data center to confirm they have stored the data.

Examples for LOCAL_QUORUM

Total Number of Nodes in the LOCAL data center

Number of operational nodes in the LOCAL data center required to achieve consistency level LOCAL_QUORUM

Maximum number of nodes in the LOCAL data center which can go down and not impact queries

Maximum number of nodes in OTHER data centers which can go down and not impact queries

2

2

0 (no fault-tolerance)

ALL nodes

3 (recommended deployment of 3 nodes per data center)

2

1

ALL nodes

4

3

1

ALL nodes

 

 

Note that if you have 2 nodes in your local data center, local quorum is 2.  In this case you have zero fault tolerance.

A consistency level of LOCAL_QUORUM is a good balance between consistency and availability.  It offers more availability than QUORUM because only nodes in the local data center are required.  It offers better consistency than ONE because it guarantees more copies of the data.  It offers less consistency than QUORUM because data is only guaranteed to be stored in the local data center.

 

EACH_QUORUM

This consistency level is similar to LOCAL_QUORUM.  The difference is that Cassandra must achieve LOCAL_QUORUM in EACH data center.  No data centers can be down.

For read requests, Cassandra will read data from a QUORUM of nodes within EACH data center. The data from the nodes is consolidated and conflicts are resolved by timestamp. The data with the latest timestamp is returned.

For write requests, Cassandra will send the write request to ALL nodes but wait for a QUORUM of nodes in EACH data center to confirm they have stored the data.

Examples for EACH_QUORUM

Total Number of Nodes in DC1

Total Number of Nodes in DC2

Number of operational nodes required to achieve LOCAL_QUORUM in DC1

Number of operational nodes required to achieve LOCAL_QUORUM in DC2

2

2

2

2

3 (recommended deployment of 3 nodes per data center)

3

2

2

4

4

3

3

4

3

3

2

 

A consistency level of EACH_QUORUM favours consistency over availability.  Data is guaranteed to be stored in all data centers but the system cannot survive the full outage of a single data center.

 

What Consistency Level does Global Mailbox use?

Global Mailbox uses several Cassandra Consistency Levels depending on configuration and the function being used.

Global Mailbox is divided up into different areas of functionality.  Each functionality uses a different consistency level.

Protocol Adapters

Protocol actions include actions that trading partners perform to upload and download files.  For example, using FTP GET to download a file or using the My File Gateway UI to upload a file for routing.

By default, the system is configured for “asynchronous” or “delayed” replication. In this mode, the system uses LOCAL_QUORUM to ensure high availability and some data consistency within the data center.  But requests are not dependent on any other data center.

The system can be configured to use “synchronous” or “immediate” replication.  In this mode, the system guarantees the data is in a majority of data centers.  This is achieved using QUORUM consistency in Cassandra.

Command Line Utilities

Command line utilities such as the appConfigUtility use LOCAL_QUORUM.

Global Mailbox UI

The Global Mailbox User interface for managing mailboxes uses LOCAL_QUORUM.

Scheduled Jobs

Scheduled jobs need to operate on the most up to date data, so they use EACH_QUORUM to ensure the most recent data is always returned.

 

What happens when a Cassandra node is down?

If a Cassandra node is down, it cannot be used when calculating the nodes required for consistency and might impact whether a query will be successful or not.

This node will not receive data inserts, updates, or deletes and will become out of sync with the other nodes.

Cassandra uses two approaches to synchronize a node that has been down.

Hinted handoff

The coordinator node is the node which is responsible for coordinating activities across all nodes in the cluster.  Each time you run a query, a different node becomes the coordinator.  The coordinator keeps track of which nodes are up or down.  For nodes that are down, the coordinator node keeps a cache of data changes that could not be sent to the node.  These are called “hints”. When the node comes back up, it asks the other nodes for the data changes it missed and applies them to the local copy of the data.

Hints are cached for 3 hours by default.  If the node is down for longer, the changes cached by the coordinator will be purged and not applied when the node comes back up.  The node will need to be “repaired” instead.

For more information on hinted handoff, see the Cassandra documentation.

Repairs

Repair is an administrative process which resynchronizes data to a node that has been down for a longer period.

Global Mailbox includes an open source software package called the Cassandra Reaper which constantly performs repairs.  The Reaper starts when you start Cassandra.

The reaper divides the data up into segments and resynchronizes the data segment by segment.  The reaper has a web UI and a command line tool to show the status of repairs of the Cassandra cluster.  See the Global Mailbox documentation on the Reaper for more information.

How do deletes work in Cassandra?

Because of the distributed nature of Cassandra, deletes don’t happen immediately.  Instead, a special marker is created to mark the data for deletion.  This marker is called a tombstone.  Tombstones are filtered out of result sets so the deleted data does not appear.  This filtering has a performance cost.

To prevent performance issues, Cassandra puts a threshold on the number of tombstones.  If there are more tombstones than this threshold, the query will return an error because it will take too much overhead to remove the tombstones.  The default threshold is 100,000 tombstones. The tombstone threshold can be configured in cassandra.yaml.

By default, tombstones remain in Cassandra for 10 days before being removed by the compaction process.  Each table can be configured to have a different time to keep tombstones. This configuration is called gc_grace_seconds.

If the application’s interaction pattern has a lot of deletes, this can be a problem. Global Mailbox uses special data structures to avoid tombstone problems.  However, there is a limitation on the number of messages which can be deleted from a mailbox.  If you delete more than 100,000 messages from a mailbox within a 2 day period (the gc_grace period for the message tables), the mailbox will not be functional due to too many tombstones.

How to avoid Zombies?

Conflicting changes can cause deleted items to come back to life so it’s important to keep the system healthy to avoid this.

Consider a situation where Cassandra is deployed across 2 data centers.  There is a network partition which prevents the data centers from talking to each other.  At 1:00pm, someone updates some data in DC1.  At 1:30pm, someone deletes the same data in DC2.  The delete causes a tombstone.

These changes cannot be replicated due to the network partition and conflict is created.  Since conflicts are resolved by timestamps, the delete should win in this case.  The data should be deleted.  However, if a compaction happens in DC2 and the tombstone is deleted before a repair happens, the data will re-appear.  This is called a Zombie:  data that should be deleted yet re-appears in the system.

To avoid this problem, you must run repairs at least once every gc_grace_seconds.  This will ensure that all data is synchronized before tombstones are removed.

Fortunately, Global Mailbox includes the Cassandra Reaper which is constantly running repairs to minimize the chances of having zombies.

What do I do if a Cassandra node is corrupted and won’t start?

In a production deployment it is recommended to have 3 Cassandra nodes in each data center.  By default, the replica count is the same as the number of nodes.  This means that every node has a copy of all data.

Due to the high redundancy of data and the default consistency level of LOCAL_QUOURM, fixing a corrupted node is relatively easy.   You can follow the instructions on synchronizing a corrupted node and the data from other nodes the cluster will be copied to this corrupted node.

What is the Cassandra Reaper?

Global Mailbox includes an open source software package called the Cassandra Reaper which constantly performs repairs to resynchronize nodes.  The reaper starts when you start Cassandra.

The reaper divides the data up into segments and resynchronizes the data segment by segment.  The reaper has a web UI  and a command line tool to show the status of repairs of the Cassandra cluster.  See the Global Mailbox documentation on the Reaper for more information.

How do I download and install Cassandra?

Cassandra is included with the media for B2Bi/SFG.  When you download the product you’ll see two installation manager repositories:  b2birepo and gmrepogmrepo contains Cassandra, ZooKeeper and other addons for B2Bi/SFG.  Use IBM Installation Manager to install Cassandra from the gmrepo.

Before installing, ensure you plan out your installation.  The Cassandra installer will ask you several questions.  Ensure you have the following information available at installation time:

  • List of hostnames of each Cassandra node
  • Data Center names: Cassandra needs to identify which data center each node is in.  Data center names are logical names and don’t need to correspond to any physical network.  Often the data center name is the city the data center is located in.
  • Rack names: Cassandra needs to identify which rack a node is located on within the data center.  To increase data availability, Cassandra ensures that copies of data are spread across multiple racks within the data center.  Rack names also logical.  If you don’t have a naming convention for your Racks, a typical naming convention is rack1, rack2, rack3 with each node being on its own rack.
  • Ports: which port numbers to use for the various Cassandra protocols.

Be sure to follow the firewall considerations to open ports needed for Cassandra within and across data centers.

How do I uninstall Cassandra?

Use IBM Installation Manager to uninstall Cassandra.  Installation Manager will remove all product binaries from disk and remove it from the list of installed applications.  All user-generated files will remain on the file system such as the Cassandra files that contain the database data.  If you wish to reinstall, clear out the user-generated data from the install location.

How do I query the Cassandra database?

Cassandra has a query language similar to SQL called CQL.  Due to the distributed nature, there are many limitations on the queries you can perform.  Some key limitations:

  • No joins
  • No ‘like’ queries
  • Limitations on which columns can be in your where clause

For details on CQL see the Cassandra documentation

Cassandra provides a command line utility called cqlsh to perform queries.  You can start the utility  by running GM_dir/apache-cassandra/bin/cqlsh <cassandra hostname> <cassandra native transport port>.  The default port is 9042. 

Once the CQLSH command is started, you can run queries.  Ensure each query ends with a semi-colon.  An example query to view replication servers is select * from gatekeeper.replication_servers;

In this query, gatekeeper is the keyspace and replication_servers is the table.

For details on using CQLSH see the Cassandra documentation.

Can I install a different version of Cassandra?

No, you must use the version of Cassandra that’s included with the product.

Who provides support for Cassandra when used with Global Mailbox? 

IBM supports Cassandra when used with Global Mailbox.  If you have a problem with Cassandra in your Global  Mailbox deployment, contact IBM support.

How do I get fixes for Cassandra? 

IBM provides fixes for Cassandra via Fix Central.

Why did my deleted data come back? 

Deleted data can be resurrected if you had a Cassandra outage and the Cassandra Reaper wasn't running correctly. 

The Reaper performs repairs which will synchronize the delete request.  If that delete request is not synchronized within a certain period of time, the data will not be deleted as expected.

It's very important that the Reaper runs and performs repairs regularly.


#SupplyChain
#B2BIntegration
0 comments
29 views

Permalink