Global Mailbox: Purging of data
This is part of a multi-part series on IBM Sterling Global Mailbox.
What is Global Mailbox?
IBM Sterling Global Mailbox helps companies address demands for high availability operations and redundancy with a robust and reliable data storage solution available across geographically distributed locations. It is an add-on to Sterling B2B Integrator and Sterling File Gateway.
Global Mailbox leverages IBM and open source technology to replicate data to many data centers (DCs) helping to minimize downtime during planned and unplanned outages.
This article explains, at a high level, how data is automatically purged from Global Mailbox.
What data is stored in Global Mailbox?
Global Mailbox stores the following objects in Cassandra:
- Virtual Roots
- Messages
- Payloads
- Event Rules (Routing Rules)
- Events (raised by an Event Rule)
- Other miscellaneous data
Other data, such as trading partner configurations, remains in the B2Bi/SFG relational database and is not replicated or directly accessed by Global Mailbox.
Overview of purging
As explained in a previous blog post, Global Mailbox uses Apache Cassandra to store meta-data related to mailboxes. Cassandra replicates this data across data centers to ensure a consistent view of mailboxes no matter which data center a partner connects to. The replicated mailbox data links together multiple, separate B2Bi/SFG clusters.
Cassandra has some limitations due to its distributed nature. One such limitation is the query capability. With Cassandra, it’s not possible to query by any arbitrary column. How does this relate to purge? During purge, the first step is to identify which objects need purging. For example, the Purger might need to run a query to search for objects by creation date. Depending on the design of the data structures for Global Mailbox, this query may not be possible.
To solve these problems, Global Mailbox has implemented a special queue-like data structure in Cassandra. Objects are added to the purge queue and the Purger uses this queue to identify which objects need purging.
Each purge queue is represented by 2 tables in Cassandra:
The purge queue table
When an object becomes a candidate for purge, the object’s ID is added to this table. For example, Message IDs will be added to the Message purge queue table when they become eligible for purge.
The purge queue is broken up into shards. This allows the Purger to start multiple threads to purge more objects at the same time. Each shard is broken into buckets identified by a time window. For instance, there is a bucket for Jan 1, 2023, 7:58 am containing all the candidates added to the queue during that 1-minute window.
The queue pointer table
This table keeps track of the position in the queue for each shard. It’s essentially a pointer in time. Each time the Purger runs, it starts at this position and works its way forward in time in the queue.
To avoid missing objects in the queue, the Purger will peek backwards in time during each run of the purge job. This time is configurable.
Each object has a Scheduled Job that runs the purging process on a regular interval. The job starts one thread for each shard. The thread starts purging at the point in time identified by the pointer.
Purging of Expired Messages
An Expired Message is a Message that can no longer be extracted. This could be for one of 2 reasons:
- The Message is using an extraction counter and the counter is 0
- The Message has an “extract until date” and the date is in the past
By default, messages are kept for 30 days after they become unextractable, then they are purged. This value is configurable by setting the com.ibm.mailbox.message.expired.ttl property in the global.properties file.
Messages using an extraction counter
Candidates for purge are added to a purge queue called Expired Message Purge Queue when the message’s extraction counter goes to 0. Remember that the purge queue is broken up into time-based buckets of 1-minute windows. In this case, the message is added to the bucket for the time at which the extraction counter became zero.
If the extraction counter is changed after it becomes zero (maybe an administrator changes it from 0 back to 1 to allow re-extraction), the message is NOT removed from the queue. This could result in a message which should not be purged being in the message purge queue. The purge job handles this situation by doing more detailed checks at purge time. Changing the extraction counter from zero back to a higher value can also result in a message appearing multiple times in the purge queue. The purge logic is designed to handle this case as well.
The ExpiredMessagePurgeJob runs every 10 minutes to process the Expired Message purge queue.
First, the purge job starts one Purger for each shard of the queue. The Purgers work simultaneously to process the purge queue, each one focusing on its own shard. The Purger starts at the stored purge pointer and moves forward in time. As it moves bucket to bucket, it updates the pointer accordingly. Remember that by default expired messages are kept for 30 days, so each Purger looks at candidates in buckets for timestamps that are earlier than 30 days ago (see the configuration parameter above).
As it finds candidate messages to purge, it does a double-check to ensure that the message is no longer extractable. For messages with extraction counters, it verifies that the extraction counter is actually zero. If yes, the message is deleted from the system. If no, the message is not deleted from the system because the extraction counter was changed from zero to a new value.
Messages using an “extract until date”
Candidates for purge are processed via an intermediary queue called the Expiring Message Purge Queue first, then enter the main Expired Message purge queue.
The intermediary queue is responsible for looking at the “extract until date” to determine if the message is no longer extractable.
Candidate messages are added to the Expiring Message Purge Queue when their “extract until date” is set or changed. Remember that the purge queue is broken up into time-based buckets of 1-minute windows. In this case, the message is added to the bucket for the time for the “extract until date”. This typically means that the bucket will be in the future. For example, if today is Jan 1, 2023 and I set the “extract until date” on a message to Dec 31, 2023 at 7am, it will be added as candidate for purge in the Dec 31, 2023 07:00 bucket.
The ExpiringMessageProcessJob runs every 10 minutes to process this intermediary queue. It processes buckets from the previous day or earlier starting at the point in time indicated by the queue pointer. Each candidate is checked to ensure that its no longer extractable.
If the message is still extractable it is not purged.
If the message is no longer extractable based on the dates, it is added to the Expired Message Purge Queue and is processed as described above by the ExpiredMessagePurgeJob, double-checking that it is no longer extractable before purging from the system.
Note: The payloads associated with the message are not purged at the same time. See below for details.
Purging of Incomplete Messages
An Incomplete Message is a message that was interrupted during upload, but never finished uploading.
These messages must be purged from the system to ensure that it is not overloaded with incomplete messages.
By default, incomplete messages are kept for 2 days. This is configurable by setting the com.ibm.mailbox.message.incomplete.ttl property in the global.properties file.
Similar to expired message purging, candidates are added to the a queue named Message Purge Queue and processed similarly via the MessagePurgeJob.
Purging of Payloads
Remember that the payload is the file content associated with a Message. There are two types of payloads:
- Small payloads (less than 10k) which are stored in Cassandra. These are called inline payloads.
- Larger payloads (greater than 10k) which are stored on disk and replicated to each data center via the Aspera FASP protocol.
See the blog post Global Mailbox: An introduction to replication for more details.
Also note that one payload can be associated with multiple messages. For example, you might want the same file content associated with multiple mailboxes. This could be done by creating a single Payload and having a message in each mailbox link to this shared Payload.
Payloads are only purged when no messages refer to them.
Purging Inline Payloads
Inline payloads are stored in the same Cassandra row as the message itself. Therefore, when the message is purged or deleted, the payload is also deleted. This means there is no purge process for inline payloads.
Purging Payloads stored on disk
Payloads on disk are kept for 12 hours by default. You can configure this value by invoking the following query in CQLSH:
USE scheduler;
update trigger set criteria_map={'PAYLOAD_PURGE_WAIT_TIME_IN_HOURS':'12'} where name = 'PayloadPurgeTrigger';
replacing '12' with your preferred number of hours.
Payloads stored on disk are replicated to each data center and each data center is responsible for purging its own payloads. To support this, the purge queue is broken up into smaller queues for each data center. Each of these smaller queues are also sharded to increase the number of payloads that can be purged simultaneously.
Candidate payloads are added to a purge queue called Payload Purge Queue when the associated message is deleted (even if there are other messages referring to it). Candidates will be added to the queue for each data center ensuring it is purged everywhere.
Remember that the purge queue is broken up into time-based buckets of 1-minute windows. In this case, the candidate payload is added to the bucket for the time at which the message was deleted.
Within each data center, the PayloadPurgeJob runs every 10 minutes to process the payload purge queue for the local data center.
First, the purge job starts one Purger for each shard of the queue. The Purgers work simultaneously to process the purge queue, each one focussing on its own shard. The Purger starts at the recorded purge pointer and moves forward in time. As it moves bucket to bucket, it updates the pointer accordingly. Remember that by default payloads are kept for 12 hours, so each Purger looks at candidates in buckets for timestamps that are earlier than 12 hours ago (see the configuration parameter above).
As the Purger finds candidate payloads to purge, it does a double-check to determine if other messages are using this payload. If not, the payload is deleted from the local disk. If there are other messages using this payload, the payload cannot be purged at this time.
Purging of Events
Global Mailbox uses Event Rules to trigger processing of files in mailboxes. When global mailbox raises an event for a message, it keeps the status of the events in some tables in Cassandra. These events can have a status of UNPROCESSED, PROCESSING, COMPLETE or FAILED.
To ensure the database does not grow unbounded, there is one purge job for each status. They are UnprocessedEventPurgeJob, ProcessingEventPurgeJob, CompletedEventPurgeJob and FailedEventPurge job.
Similar to purging of other objects, events have a purge queue. Event purge jobs will purge events using the same technique as the other purge jobs described earlier.
Purging of Virtual Roots and Mailbox Permissions
A virtual root defines which mailbox is the user’s main mailbox that they see when logging in. These are stored in a table in the Cassandra database. Additionally, each user can be assigned a set of permissions on any mailbox. These permissions are also stored in a table in Cassandra.
Users themselves are not stored in Global Mailbox and are managed outside the scope of Global Mailbox.
When a user is created in B2Bi/SFG, it is not replicated to each DC. User information must be created or deleted in each data center to ensure consistency of partner information across the entire deployment.
If you remove a user from one data center, it’s not removed from other data centers. Since the user could still exist in other data centers, removing a user does not directly remove the user’s virtual root in Global Mailbox. Instead, there is a scheduled job named UserPurgeJob which ensures that data related to a user is purged.
The UserPurgeJob works differently than the other jobs. It does not have a queue of object IDs it wants to purge. Instead, it makes a REST API call to B2Bi/SFG to get a list of all users in the system. It compares that to the list of Virtual Roots in Cassandra. If there is a Virtual Root for a user which does not exist in the list of users returned from the REST API, the Purger removes that virtual root. It then checks all of the permission records for mailboxes. If the permission record is for a user which does not exist in the list of users returned from the REST API, the Purger removes that permission record.
This job runs every 7 days by default.
Occasionally, IBM Support receives cases about Virtual Roots disappearing from systems. This can happen when the user/partner has not been created in all data centers. For example, if you create a partner in DC1, but not in DC2, this will cause the UserPurgeJob to purge the virtual root for that partner.
It is important that partner data is manually synchronized across data centers to prevent the accidental purging of user-related data in Global Mailbox.
Summary
Purge in Global Mailbox works a bit differently than with B2Bi/SFG. Specialized purge queues in Cassandra are used to identify purge candidates. Scheduled Purge Jobs run regularly to process the queues in Cassandra to remove the data.
#B2BIntegration#Highlights-home#SupplyChain#Highlights