IBM Event Streams and IBM Event Automation

 View Only

Using Mirror Maker 2 with IBM Event Streams to create a failover cluster

By Dale Lane posted Mon April 08, 2024 05:30 PM

  

This is the fourth in a series of blog posts sharing examples of ways to use Mirror Maker 2 with IBM Event Streams.

Mirror Maker 2 is a powerful and flexible tool for moving Kafka events between Kafka clusters.

For this fourth post, I’ll look at using Mirror Maker to create an active/passive topology with a backup cluster ready to failover to.

This is a different type of demo to the last three. In previous posts, I’ve shown how to use MM2 as a part of your topology – illustrating the benefits of mirroring to optimize normal day-to-day operation.

In this post, I’ll show how to use MM2 to create a passive backup environment, only to be used in the event of the loss of the active primary environment.

Demo

For a demonstration of this, I created a three-region version of this pattern:

Three Kubernetes namespaces (“north-america”, “south-america”, “europe”) represent three different regions.

The “North America region” represents the primary, active environment for the Kafka cluster.

Applications run in the “South America region” and produce and consume from topics in the Kafka cluster.

As with previous posts, the producer application is regularly producing randomly generated events, themed around a fictional clothing retailer, Loosehanger Jeans.

In the background, Mirror Maker 2 is maintaining a passive mirror of the Kafka cluster in the “Europe region”.

After simulating the loss of the “North America region”, applications switch to using the “Europe region”.

The applications are able to resume more-or-less where they left off before the failover, enabling our fictional Loosehanger Jeans business to continue.

To create the demo for yourself

Setting everything up

There is an Ansible playbook here which creates the first stage of this:
github.com/dalelane/eventstreams-mirrormaker2-demos/blob/master/05-active-passive/setup.yaml

An example of how to run it can be found in the script at:
setup-05-active-passive.sh

This gets you to this state:

This script will also display the URL and username/password for the Event Streams web UI for North America and Europe regions, to make it easier to log in and see the events.

Once you’ve created the demo, you can run the consumer-southamerica.sh script to see the events being received by the consumer application in the “South America region”.

If you log in to the Event Streams web UI for the active cluster in the “North America region”, you will see information about the consumer application listed there.

You should see that the offset lag for every partition is 0 (or close to it) as the consumer is running, and keeping up with the events on each of the LH topics.

If you log in to the Event Streams web UI for the passive cluster in the “Europe region”, you will also see the consumer application listed there as well.

This isn’t a separate application. There is no consumer running connected to the “Europe region”.

For this scenario, Mirror Maker is mirroring the state of consumer applications as well as topics – this is a mirrored record of the same application consuming from the “North America region”.

Notice that you will likely see that the offset lag described for every partition is up to 25-or-thereabouts.

This is because for this demo, Mirror Maker is configured to mirror the offset every time the lag reaches 25. I configured Mirror Maker to wait until the application’s offset has increased by 25 (since the last time it was sync’ed) before sending an update.

Simulating a disaster

In this demo, Kubernetes namespaces are being used to represent regions. To represent the total loss of the “North America region”, you can delete the north-america namespace.

oc delete project north-america

Failing over to the “Europe region”

There is an Ansible playbook which performs the failover.

This playbook re-configures both the producer and consumer application to resume what they were doing, using the Kafka cluster in the “Europe region”: github.com/dalelane/eventstreams-mirrormaker2-demos/blob/master/05-active-passive/failover.yaml

An example of how to run it can be found in the script at:
setup-05-failover-to-passive.sh

This gets you to this state:

Now that there really is an application consuming from the Kafka cluster in the “Europe region”, the Event Streams UI for the “Europe region” will show a live view of the consumer.

After the failover, you can run the consumer-southamerica.sh script again to see the events received by the consumer application.

Observing what happened

You can use the output from the consumer application before and after the failover to see what happened for yourself.

Filtering the logs for a specific topic name can help make it easier to follow. For example, when I grep’ped my logs to show only lines with LH.ORDERS, I got this for the end of the consumer log immediately before the failover:

And I got this when I grep’ped the start of the consumer log immediately after the failover:

I’ve colour-coded the rows to make it easier to see what happened.

Green : messages that were produced to the “North American region”, and consumed from the “North American region”

Red : messages that were produced to the “North American region”, and consumed twice – from the “North American region”, and then again from the “Europe region”

Blue : messages that were produced to the “Europe region”, and consumed from the “Europe region”.

Roughly 25 events (29 in this case) were processed twice – before and after the failover.

However, if you check the topic you should see that there are no duplicate events on the topic, as Mirror Maker has been configured to avoid this.

# avoid mirroring duplicate events on restarts or errors
exactly.once.source.support: enabled

mm2.yaml

These are events that were processed twice – once when they were consumed from the “North America region”, and then again when they were consumed from the “Europe region”.

Understanding what happened

Here is an simplified illustration of what the logs show. To keep the diagrams simple, I’ve reduced the numbers of events.

At the point where the disaster happens and the “North America region” is lost, imagine if this was the state:

The producer has sent events about these orders to the LH.ORDERS topic:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10

There are two orders that the consumer hasn’t yet processed. It has so far only processed orders:
1, 2, 3, 4, 5, 6, 7, 8

It has committed an offset so the Kafka cluster in the “North America region” has a record that it has processed these orders.

Mirror Maker is mirroring the order events to the “Europe region”. It has mirrored the events for orders:
1, 2, 3, 4, 5, 6, 7, 8, 9

Mirror Maker hasn’t mirrored the event for order 10 yet. It was just about to, before the region was lost.

Mirror Maker mirrored the consumer application’s consumer offset when it was up to the offset for order 5. It was waiting for the lag to be higher before it updated this again.

Then the region was lost, and everything failed over to the “Europe region”.

Everything started running again, and after a brief time, things looked like this:

The producer started sending events about new orders:
11, 12, 13, 14

The consumer started consuming from the offset that was stored before the failover, so it started by processing orders 6, 7, 8 and 9 (again). Then it carried on processing the new orders, 11, 12, 13, 14.

What is this meant to illustrate

Impact on the producer: None
It was able to resume producing events to the topic on the failover cluster.

Impact on the consumer: Some events were re-processed
Consumer offsets are mirrored periodically. Events since the last time the offset was mirrored will be re-processed because the application doesn’t have a record of having already processed them. The number of events that are likely to be re-processed are related to the offset lag parameter discussed above.

Impact on the topic: An order event was lost.
The event for order 10 was produced just before the active region was lost, so there wasn’t time for it to be mirrored.
(I didn’t see this happen in my logs – as far as I can see, I got lucky this time and everything was mirrored in time. But I’m mentioning it as it is possible.)

Overall, my aim here is to highlight the limitations of asynchronous, background mirroring of a constant stream of events. There will always be some data around the time of the loss of an active region that will not have yet been mirrored.

However, it does show that mirroring is an effective way to enable business continuity in the event of a diaster, where applications can be designed to tolerate re-processing of events.

How the demo is configured

The Mirror Maker config can be found here: mm2.yaml.

The spec is commented so that is the main file to read if you want to see how to configure Mirror Maker to satisfy this kind of scenario.

More scenarios to come

I’ve still got some more ideas of scenarios where Mirror Maker is useful, so more posts will come soon.

0 comments
21 views

Permalink