Message Image  

Splitting up the ESB: Grouping integrations in a containerized environment

 View Only
Thu July 23, 2020 10:50 AM

In previous articles and posts on agile integration architecture (AIA) we have discussed the benefits of breaking up the centralized enterprise service bus (ESB) pattern. Indeed there is a webcast specifically on that at the beginning of our webcast series. This post is a more detailed treatment of a later webcast in the series that looked at how to choose the right granularity and grouping for the integrations within containers.


The agile integration architecture approach ideally (although not necessarily) involves moving to a container-based infrastructure, allowing integrations to be deployed more independently. At the extreme end of this approach we could imagine a separate container for each and every integration. Whilst its technically possible to do this, it is likely that it would be a step too far, resulting in a more numerous and complex collection of components to administer. It is more likely we will go for a more measured approach with groups of integrations in each container. The container would only have one integration runtime process, in keeping with good practice, but that integration runtime would have a group of integrations loaded into it. So, we clearly have some decisions to make around how we want to group our integrations together, and that is the topic of this post.

To put this in practical terms, it is not uncommon to hear of ESBs that run 100s of discrete integrations across a highly-available pair of servers. It is easy to see the issues this creates in terms of the risks posed when deploying or updating integrations to the currently running integrations. Furthermore, any upgrades to the underlying integration runtime in order to take advantage of new features require at least some regression testing across all the integrations present, and potentially some downtime in order to perform the upgrade.


In a more decentralized and containerized architecture, we might move a group of integrations in each container. These integrations could then be maintained independently, and the underlying integration runtimes patched independently, with no risk to integrations in other containers. This then offers benefits of agility, productivity, elastic dynamically optimized scalability, and more fine-grained resilience models associated with agile integration architecture. Ultimately, this brings faster time to market, reduced system downtime, and more cost-effective use of resources.

So, what criteria should we use to decide which ones could/should stay together, and which ones must be separated from one another?

What grouping do you have today?

As noted in the App Connect Enterprise adoption post, many customers have been grouping their integrations into separate Integration Servers (execution groups as they used to be called prior to v9 of the product). It may well be that you have already evolved a good strategy for grouping of integrations into suitably independent groups. It’s certainly wise to take that as your starting point. If it has been providing sufficient decoupling of deployments to date, this may be all you need. As they say, “if it ain’t broke, don’t fix it”!

However, for many, either this type of separation has not yet occurred, or is has occurred perhaps as a result of more hasty tactical decision and re-visiting the grouping would be wise.

Splitting by business domain

The simplest place to start is business domains. Integrations owned (created and maintained) by radically different parts of the business would be better separated from one another such that they have less chance of affecting one another whether at deployment or runtime.


An obvious example of a very course grained split of this type would be in an insurance company which has completely different business domains handling general insurance, life insurance/pensions, and health insurance. There would be little advantage in integrations built by these very separate domains sharing infrastructure. Indeed, it is likely that such a coarse-grained split will already be evident, with each business domain a having separate enterprise service bus infrastructure.

If we plan to move to containers, we can retain this strong separation between business domains using, for example, Kubernetes namespaces and network policies to separate them at a network level. This might be supplemented by container based software defined networking such as Calico. At the extreme end we could of course have different Kubernetes clusters for complete separation even at the infrastructural level. There are clearly many options, but this is well beyond the scope of this post.

The same concept could then be followed within a domain to group integrations that into sub-domains (3a, 3b, 3c). So continuing with the insurance analogy, within the general insurance domain, we might see a natural split across motor insurance, house insurance, travel insurance and so on. Splitting by sub-domain is useful if the integrations are genuinely owned (created and administered) by very different business groups, and as such would benefit from being handled independently. There may however be little benefit in splitting buildings insurance from contents insurance if they are in fact looked after by the same team of people. We would need to look for another reason to sub-divide them as we will discuss next.

What about integrations that span business domains?

Whilst business domains provide a convenient way to sub-divide our integrations, we should recognize that integration by its very nature often crosses business domains.


In our example above, integrations a, b, c and d are contained within a domain. However, an integration such as e is an API implementation which aggregates calls across multiple domains. For example, an API that aggregates all the insurance contracts owned by a single individual, across all business domains. So, it might collate details about their travel, car, building, life, and health insurance to provide a single view of our relationship with the customer.

Another common example is f, an integration that propagates events from one domain into another. A common scenario is synchronizing business data such as a customer’s address that might be duplicated in applications in different domains.

Ideally, we would want to allocate cross-domain integrations like e and f in their entirety to a specific domain even though they have contact with other domains.


In some cases the right approach may even be to split the integration into several parts each of which have a more clear ownership within a domain. In the example above, Application Y uses an integration to formally expose an API for general re-use. Domain 1 then has a listener integration that receives events from Application X, and it then propagates the event data to Application Y via its new API.

Forcing ownership decisions for these crosscutting integrations is not such a bad thing. All components in an architecture need an owner if they are to be maintained effectively over the long term.

Grouping within a domain

Having performed an initial split of the integrations based on business domains, we now need to look for reasons why integrations might need to be grouped together within those domains. In the following diagram there are some possible reasons we might group integrations together. In the remainder of the post we’ll explore the pros and cons of each.

Stable requirements and performance

Maybe the most obvious reason to group integrations would be if they are stable from a requirements and performance point of view. Perhaps few if any changes have been required on these integrations for several years, and the workload they serve is well known and predictable. There would seem little point in separating these integrations into individual containers.

The only thing against keeping them together might be robustness – if one of the integrations were to suffer a failure that had an effect on the overall runtime, it might bring the rest of the integrations down with it. However, given these integrations have matured over several years, most of their usage permutations have already been explored. In other words, if they were going to have a catastrophic failure, it would probably already have happened. Furthermore, if they’ve been living together on the same server until now, we can assume we’re comfortable the availability they provide. As such we can be reasonably confident that they can live alongside each other and retain the same level of service as we currently have.

Technical dependencies

There may be sets of integrations that all rely on a key runtime dependency. The most obvious example is those that need a local MQ Queue Manager to be present within the container, but we’ll look at some other examples too.

Local MQ server dependency

Although the hard dependency on a local MQ server was removed in v10, many interfaces were written using MQ server bindings, since it could be assumed that a default local queue manager was present. Many of these could be refactored to use client bindings and thereby not have a dependency on a local queue manager, but some interfaces will continue to require a server binding due to the nodes used in the flow, or their transactional requirements.


Those requiring a local MQ server typically also require a persistent volume. As a result they will be more restricted in terms of how dynamically they can scale up and down. We should at the very least separate out the integrations that do not need a local MQ server such that they can enjoy elastic scalability and faster startup times.

There is an existing article detailing when you do and do not need a local MQ queue manager for App Connect Enterprise.

Synchronous vs. asynchronous patterns

There are a huge number of different integration patterns, but at a very high level they can almost always split up into two core types.

  1. Request-response (blocking): The caller waits for the integration to occur since they need the final result to continue. This applies to most web services or REST APIs. Minimizing “response time” is critical for these interactions as the caller is blocked until the integration completes.
  2. Fire-forget (non-blocking): Those that react to asynchronous fire/forget events. The initiator of the event does not block waiting for a response. The focus for these interactions is maximizing “throughput” – the rate at which events are processed.

Note that the terms above refers to the overall interaction pattern, not to the transport being used. For example, request/response calls can be over transports such as HTTP, but equally over messaging transports such as IBM MQ. Equally you could do a fire-forget style interaction over either transport too.

We should aim to separate these to core types since they will likely require very different configuration with regard to aspects such availability and scaling.

Cross dependencies

If an integration is completely dependent on the availability of other integrations, that may make a case for them to be deployed together. There are of course other ways to handle this situation.


The related integrations may be better combined using sub-flows within the same integration. However, there are times when the flows need to be separate, perhaps because they not only call one another but are also independently callable.

Scalability

With modern internet facing applications it is often very hard to predict future workload. If a mobile application is successful it could reach enormous numbers of users. Container orchestration platforms enable elastically increasing the number of replicas and then reduce them once the load decreases. Integrations that are likely to need to scale together could be placed in the same container so that they can scale together via the same replication policy.


This may help with pre-emptive, efficient scaling; rather than scaling based on the throughput of just one interface, we could react to the sum of the throughput on all related interfaces.

However, there are downsides too. We can no longer maintain the integrations independently, nor can we provide differing scaling policies if we find that some of the integrations react differently to load.

Resilience

To achieve very high availability requirements such, for example “five nines” (99.999% availability) where only 5 mins of downtime a year is acceptable we may need to ensure that a significant number of replicas are always running. The more replicas available, the less likely a requestor is to experience downtime should any individual instance fail.


We should note of course that the number of replicas is only part of any high availability story, we also need to ensure no single points of failure in any of the underlying resources. For example the replication policy would also have to ensure replicas were spread across multiple physical nodes, and of course make sure that the underlying systems behind the integration have appropriate availability themselves.


It certainly makes sense to isolate those integrations that have significantly higher availability requirements such that they can have a separate replication policy. However, we should recognize that there are diminishing returns on grouping too many integrations together in this situation. Lets consider that each integration has some probability that it could cause an outage of the whole container and all the integrations within it. The overall probability of an outage is then potentially greater than the integrations would have faced if they were deployed on their own. The more integrations in the group, the worse the availability. Ultimately for very specific high availability requirements, a separate runtime (and therefore container, and indeed pod) for each integration is probably required.

Shared lifecycle

Do we have a set of integrations that always get maintained together, and released into production in synchronization with one another? These might be a good candidate to be grouped together in the same container. This might indeed make ensuring a consistent release across the set easier.

A common situation where this may occur is where integrations use shared data models. By this we mean that they all use a single data model that must remain in step across these integrations. This sometimes happens because of data model that implements a particular versioned industry standard. Integrations locked in step on the same data model will often have the same release cycle for major changes that accompany changes to the underlying data model version.


If the integrations are deployed to the same runtime, it makes it easier to deploy them consistently. Furthermore, we can also consider using a shared library within the integration runtime so we only have to update one copy and we can ensure consistency across the integrations.

Another reason for synchronized data models across integrations might be that they are tied to the release cycle of the application whose data they expose. For example, changes to an integration to incorporate extra data fields from a system of record, might require changes on the system or record itself as well as within the integration. Multiple integrations may need to change in the same release cycle as the system of record, so it may make sense for them to be deployed together.

However, the existence of a shared data model doesn’t guarantee a shared lifecycle for the integrations. Although changes to data models are a very common cause of changes to integrations there are certainly plenty of other reasons integrations might need to be updated. We need to look at the past history of changes to see whether there is a clear trend of the integrations being released together.

A worked example

Although we didn’t announce it at the start, we’ve actually deliberately walked through the criteria for grouping integrations in roughly priority order. Let’s walk through what that might look like with a fictional, but vaguely realistic example.

Let’s take our insurance company example from earlier and imagine they have a centralized enterprise service bus that currently contains 100 integrations. We wanted to decide how best to group them into containers.

Initially we look at the core business domains, who we assume would prefer to have full ownership of the integrations that pertain solely to their aspect of the business. We find that 20 of the services relate to the Life and Pensions business domain, 40 relate to the various General Insurance policies (car insurance, building insurance etc.), 25 relate to Healthcare Insurance. We group these integrations by their business domains. The remaining 15 are cross domain services that relate to providing a single view of customer by calling the services of the other domains. We decide to bring these integrations together into a new “Customer” domain, in some cases re-factoring them where a portion of it really belonged to one of the other business domains.

We then look at each domain in turn for further opportunities to group services. Within the General Insurance domain for example we see that there are a number of synchronous integrations that retrieve data in real-time to enable status inquiries on customers insurance products, policies people have taken out, and the claims they have made, etc. These will all need to be tuned such that they provide a rapid response time for customers using the web portal and mobile application, requiring highly elastic scaling. Then there are a number of asynchronous integrations that for example process the regular payment transactions for the policies. These need to be tuned for overall throughput since they must complete before the end of day batch jobs begin. They are carefully scaled to ensure they make best use of the back end systems processing capabilities, but never overload them as that would reduce efficiency and reduce the throughput rate.

Next we notice that of all the synchronous integrations, “get quote” has a particularly critical response time. If it doesn’t respond within 2 seconds then its results will not even be included by the insurance broker sites. We therefore decide to bring that integration out into its own separate container such that we can ensure it is as lightweight as possible for such that we can configure it for rapid elastic scaling.

We also notice that when customers are buying a new policy, they are very sensitive to outages during the application process. They will very likely go to a competitor insurance if any type of outage occurs whist, they are in the middle of a purchasing decision. We therefore choose put all the synchronous integrations relating to the buyer journey into a separate container so we can configure differently. We set a minimum of 6 replicas, spread across 3 availability zones, to make them appear as robust as possible even in the face of individual outages. The most critical of the integrations we break out into a container of its own to further reduce the probability of being affected by an outage of any sort.

Finally, we look at the remaining large group of integrations and spot that there are a number of them that are bound to a specific version of a payments related data model. The model is dictated by the payments partner, so if it changes, it will change for all related integrations. Looking back at the history of changes for these integrations, it seems the only times they have been changed in the last few years have been because of changes to the payments data model. We choose to group them together in a container because we can see we will almost always maintain and deploy them at the same time.

Of the remaining integrations, we leave them grouped together except for a few are related to a pilot project we are working on which is forcing very regular changes. We separate these out so they can be independently amended at a rapid pace.

Conclusion

There are many different criteria we might choose to help us break up a large number of integrations on a traditional enterprise services bus into a collection of more lightweight, decoupled containers.

We have discussed a variety of common criteria based on functional (domains) vs non-functional (scaling, availability), volatility vs. stability, and various forms of dependencies (MQ, data models etc.).

What’s clear is that some of these criteria are overlapping so we’ve discussed the importance of establishing priorities, beginning with high level business ownership and working down to more pragmatic concerns.

We’ve provided a simple framework for making your own decisions about grouping of integrations as you move on your containerization journey. However, we recognize that every enterprise is functionally different, with a differently evolved landscape, and different business priorities. We welcome comments on the approach such that we can improve the advice we provide.

We should of course always aim to design our integrations such that they are strongly independent of one another. This whey we can easily change our mind regarding the grouping decisions and implement that change with minimal risk.

Acknowledgements

Sincere thanks to Andy Garratt for important contributions and review on this article.

1 comment on"Splitting up the ESB: Grouping integrations in a containerized environment"

  1. Akhil November 04, 2019

    Thanks Kim for a detailed and very useful write-up

    Reply (Edit)