IBM Event Streams and IBM Event Automation

 View Only

Running an Enterprise-wide Event Backbone

By Benjamin Cornwell posted 11 days ago

  

Introduction

Event driven architecture is a different way of thinking about your internal data flow. Rather than applications sending their data directly to another application that might be interested in it, you simply publish that message to a topic, and any number of other apps can subscribe to it without involving the originating application.  Apache Kafka is an ideal platform to support this design for a number of reasons we won’t go into here – I’ll assume you’ve already chosen Kafka.

IBM’s Kafka offering is called Event Streams and is part of Cloud Pak for Integration, which runs on Red Hat OpenShift.  Together with Event Endpoint Management and Event Automation platforms it offers a number of features that make life easier and better for anyone running an event backbone, but much of this article applies to basic Kafka as well.

An event backbone isn’t just an installation of messaging software, it’s a critical piece of infrastructure, just like your data platform, your network, or storage platform.   Just as you have a database architect, storage architect and a network architect, you need a messaging architect, just like in the old days, you would have had an ESB architect.  This role intersects with the other platform owners on an equal footing.

Design

If you’re running an event backbone in even a medium sized business this could be a pretty large undertaking with many hundreds or thousands of topics, producers and consumers, and potentially high message throughput.  Kafka is not a particularly complex product in terms of how it works, but there are a few points that need to be considered here.

Thick-ish Client

A ‘thick’ client has most of the functionality on the client side, and only gets basic information from the server, for example a modern web application is a thick client. The server simply sends HTML, images, JavaScript files to the browser along with user information like account details etc; the browser itself runs all the code and does all the work, and it keeps track of the user.  However, twenty years ago web browsers were thin clients just displaying HTML, all the work was done on the server.  The server handled the user’s login context and knew who all the users were so it could keep track of what to display.

A Kafka consumer however exists on both the server and client side.  A consumer is stateful, and some of that state is kept on the client and synchronised with the server.  As an example, if a consumer loses connection it needs to synchronise with the server when it reconnects to make sure it doesn’t consume the same message twice (if you’ve configured it that way).   You can have multiple consumers as part of the same consumer group, this means that messages from a given topic are distributed amongst the consumers so messages can be processed in parallel. The topic is broken up into partitions, and these partitions are shared among consumers in a group. Here’s the part where this matters:  If you want to add another consumer, you remove one, or one goes down for more than a certain time, this means that the partitions have to be rebalanced between the consumers and that is an expensive operation. If this happens frequently then your overall performance can suffer.

Topic configuration affects performance

A topic is divided into partitions – this is to allow multiple consumers to consume messages in parallel.  A partition can have multiple replicas which need to be distributed across the brokers in the cluster. One is the leader, the others are followers. A producer sends its message to the leader, and the followers must synchronise with the leader.

If you need to process a high volume of messages, OR if your messages take a long time to process then you might need a lot of parallel consumers and hence a lot of partitions.  I if you don’t, then you don’t.  The key point here is that each partition has to be managed by the cluster, and if you have too many partitions you are placing extra load on the cluster.

The upshot of this is that each topic should have its own configuration that is suitable for its workload, and your solution should support that.

Consumer Performance Affects Cluster Performance

Producers produce at their own rate; consumers consume at theirs. This means that consumers can lag behind consumers. This isn’t necessarily a problem in the short term – Kafka can handle this – but in the long term it’s clear to see that there could be an issue if consumers never catch up. However, there is another caveat.

When a message is produced and a broker receives it, the broker writes it to disk BUT this doesn’t mean it ends up on disk straight away. The operating system caches disk writes in memory until the page cache is full.  Therefore, if a message has been written only recently, it will still be in memory when a consumer reads it. After some time however the message will be flushed to disk. After this point, when a consumer wants to read it, the broker will have to load it from disk. Clearly this is a lot slower for the consumer but it also means the broker’s operating system has to do a lot more work.

Therefore, if consumers are slow and lag behind significantly it can end up slowing the entire system down.

Different messages have different requirements

In all message driven systems, we must bear in mind that different messages need to be handled differently.  Some messages must be preserved, processed in order, and processed exactly once – for example, financial transactions.  If one transaction is missing at any point in time, the account balance will always be wrong.  If a message goes missing and is recovered later, it still has value.  On the other hand, a message describing the current stock level of a commodity is only valid at the time it is produced and is invalidated by the next message.  If a message is lost, then any clients will have incorrect information but only until the next message arrives at which point the system will correct itself.  If an old message is subsequently recovered, it has no value.

A Kafka topic has a message retention period per topic.  This needs to be chosen specifically for the requirements of that message.  If message retention is too long, then you risk wasting storage; but consider that if a consumer is disconnected for some time and then reconnects, it may start consuming from an old offset that needs to be reloaded from disk, but it may be wasting its time consuming old messages. Or it may not, depending what the messages are!

Storage

This one’s quite straightforward: Kafka replicates messages to different brokers, so you don’t need replicated storage. In fact, storage latency is quite critical, so having replicated storage at the cost of latency could cause issues.

Configuration Management

Environments

It’s tempting to install Kafka clusters for development, UAT and production, to match your traditional environments, and this will absolutely work.  But we can do better than that.  Remember that your event backbone is a critical service, and even your developers are depending on it. If your dev cluster is down, or slow, it could prevent developers from getting their work done which could impact targets, deadlines, and delivery capability.

Development

Development environments can be pretty chaotic.  If you have dozens of teams writing apps that need to produce and consume messages, they will need a cluster to run against.  But in dev, people often mess up their config or make mistakes that can break or overload something.  One of the benefits of running containers is that developers can easily create their own platform, and with IBM Event Streams Kafka is no different. You can set up a lightweight single-node cluster with ephemeral storage to test your application, and then take the cluster down when you’re done.  This means that if your code goes wrong and overloads the cluster, no-one else is affected.

Testing

A key part of the modern DevOps driven environment is declarative configuration – you define how you want an environment to be and how it’s built, then you run a DevOps pipeline that builds it.  Because this is an automated process, you can create a clean environment, run integration or performance tests, then take it down.  Even a UAT environment can be created this way and taken down when the testers are finished with it.  This means that you are always testing on a new clean environment, and if you overload the environment during say performance testing, you are not affecting anyone else’s work.  Likewise if your code fails and breaks something.

Because IBM Event Streams is configured purely using YAML files, these can be held in a source control repository and treated as source code.  With other Kafka platforms this can still be done, using Helm, Ansible or some other tool.

Production

In production, you are probably expecting to have one Kafka cluster for everything.  If you do this, you will need some kind of traffic management to protect the cluster from overloading (or reduce the “blast radius” when there is a problem) – see section 3.3 below.  There are a number of things that can help.  On the other hand, it could be a great idea to have multiple clusters.  You can segregate workloads onto different hardware to increase isolation, or you could configure a cluster using different hardware or bare metal nodes to reduce latency for certain critical apps. See section 3.4 for more discussion.

Topic Setup

As mentioned in section 2.2, topic configuration determines message behaviour which will affect how the application behaves and the demands it places on the cluster. In other words, it’s equivalent to a part of your code.  That means that the application designers need to own the topic specification, but it needs to be validated by the messaging architect. 

Client Setup

It’s more obvious that the client configuration is part of the application, but it should still be validated by the messaging architect.  Producer and consumer configuration must be appropriate for the workload.

It is important that each client connects with its own set of credentials.  Clearly this provides better security since the credentials can be revoked individually, but it also provides the ability to restrict access for certain applications to certain topics and transaction IDs.

Isolation

As mentioned in section 3.1 there should be different clusters for different environments and even for different purposes in different environmental levels.  Traditional Kafka installations can be segregated on different hardware and different networks; Kubernetes based installations can have different dedicated worker nodes and different ingress routers.

Kafka also supports time-based quotas for different service accounts (that should be linked to applications) so that each application cannot flood the cluster.  In dev, we can safeguard against this using dedicated ephemeral clusters for development purposes but it can also be used in production. If for example a client application is driven by an external interaction say, with an application owned by a business partner, it could still produce too many messages and overload the cluster.  Quotas can prevent this from happening.

Service Levels

Besides isolation as a defensive tactic, the same management techniques can be used to provide different service levels.  If for example certain messages need to be processed with very low latency, they can be sent to a different cluster running on faster hardware or with a faster network.  Note that a Kubernetes or OpenShift cluster can support multiple Kafka clusters, and they can be allocated to different machine sets so they run on different or dedicated hardware.

As far as topic configuration is concerned, topic owners could create their own if they are able, and you trust them to do so, but the messaging architects can provide templates for different service levels.  You can make choices with the topic configuration to reflect the service level as well as producers and consumers.  The service level should reflect the requirements of the app.  For example, topics could be configured for any of these purposes:

Highly parallel consumers:  Producing messages is always quick, but a consumer may need to do something relatively complex with a message that could involve interacting with another system. For example, if financial transactions can be posted to a topic a consumer might have to invoke some downstream fraud detection system for each one which could take some time.  In that situation you need a larger number of consumers instances in the same consumer group, and to enable this you need a larger number of partitions.

The other reason for a larger number of partitions is to enable very high speed producing, since there is a limit to how fast data can be written to a partition.

Reduced resource consumption: If a topic does not need to be highly available and/or few messages are being produced and consumed, you may need only one partition with on replica.  This avoids committing resources to a topic when it’s not necessary.

Long lived messages: Some messages may need to be preserved for a long time, some may not.  Do not waste resources keeping messages you don’t need.  Set the message retention period appropriately for each topic.

Exactly once delivery: Kafka producers and consumers can be configured for “exactly once” semantics – meaning that each message is produced only once and consumed only once.  That sounds great, but it does reduce throughput. So consider if you really need this. As per the earlier example, if you are processing some kind of financial transaction and debiting someone’s account, you do need exactly once. But if you’re sending each transaction to say, a fraud detection system, then you might not need it – after all, does it matter if you send the same message twice to check for some red flag?

This list is not exhaustive, but you need to have this discussion for every client application and topic.  There are many parameters that allow you to configure the topics, producers and consumers for the workload, you should be familiar with these.

Disaster Recovery

Common practice for disaster recovery is to use geo-replication to copy messages to a different site after they have been produced to a topic on the main site.  This is done on a topic-by-topic basis. Given that there is a resource cost associated with this, you should consider which topics actually need to be replicated to the DR environment.

High availability covers availability of service, and availability of data.  If a data centre becomes unavailable, your clients could fail over to a second data centre.  They’d continue to produce messages to the new DC and the consumers would consume them. If the consumers are lagging behind say, five seconds, they would lose five seconds of data. Is this important?  Well like everything else it depends on the messages. If they are financial transactions being send to a fraud detection system, then yes it might not be ideal to lose some – because losing an entire DC is very rare, and the risk of losing 5s of transactions might not be worth the cost of replicating the topic. If, however, you’re legally required to check every transaction, then you do need to replicate to the DR site.

Governance

Kafka Credentials

Also known as Kafka Users, these are the credentials that client apps use to connect to the cluster.  Make sure that each client application has its own credentials with the minimum required privileges, as you would for any system.

OpenShift RBAC

To implement ephemeral clusters in OpenShift or Kubernetes as part of your DevOps process, you will need to create users with appropriate access rights.  For IBM Event Streams you will create clusters and KafkaUser objects by creating custom resources on the cluster – which means access to the oc/kubectl command or the API server directly.  A user's build that includes their ephemeral cluster will have to run under their username and with their credentials to prevent them deploying anything in an inappropriate namespace.

For non-container deployments a build tool such as Ansible will be required, with user access along similar lines.

Topic Discovery and Self Service

As part of Cloud Pak for Integration, alongside Event Streams IBM offers Event Endpoint Management.  This provides capabilities similar to API management but for messaging topics which are represented as ‘Async APIs’ which are part of the OpenAPI specification.  Event providers can expose their topics in the portal; consumers can browse for the topics they want and subscribe.  It also offers other management features like versioning, isolation, access control and more.


#Spotlight
0 comments
17 views

Permalink