Cloud Pak for Integration

 View Only

Single Point of Failure Considerations of GitOps in CP4I

By Aiden Gallagher posted Thu February 02, 2023 02:51 AM



As Integration Architecture’s become more sophisticated, using automation to complete the build, configuration and deployment of integration run times we begin to see more use of Development Operations (DevOps) and specifically for Git implementations (GitOps).

It is important to be aware of where these operational procedures can go wrong so that they can be mitigated against.

Why the reliance on Git?

We store lots of important information on code repository systems like Git or SVN, such as:

  • Integration infrastructure configuration
  • Application configuration
  • Build instructions
  • Pipelines

By having a central store of information, we have a single version of the truth. Anyone with the right access can see what is – or at least what should be - deployed in each environment. A clear line can be drawn to say – if it isn’t in Git, it isn’t part of the production code. This ethos can lead to a reduction in environmental drift and the use of manual interventions.

We also see that Git can be used as a springboard for automation either through a git webhook mechanism, where a change triggers an automatic build and deployment into a cluster, or a manual push mechanism where users trigger them. This has the benefit of being repeatable, having generally clear outcomes and the ability to reduce user admin access to push changes which can be a human error or bad actor risk.

In CP4I, we can reference various elements in the operator YAML in order to pull artefacts either from Git or a URL in some form of off-cluster repository. For example, an IBM ACE BarUrl, an image repository, secret management tools etc.

What does this mean?

With this reliance on the external systems comes the concerns of availability and the need to consider the impact of their availability and potential network issues. Depending on what resources are being stored and when they need to be accessed determined the potential impact:

  • Deployment Custom Resources – Custom resources are extensions of the Kubernetes API (YAML) and within OpenShift are used to deploy instances of the CP4I applications or push changes to existing services. Usually utilised by pipelines.
  • Artefacts – Application resources such as a BAR file, mqsc script, environment specific API YAML. These can be stored in an artefact library, Git or within the cluster itself as ConfigMaps or Secrets.
  • Images – container images can be stored in image repositories. IBM images are available on, but customers can make use of other external storage locations such as DockerHub and Quay, or on an internal image repository such as directly on OpenShift.

Where these are stored externally and there is an issue, several knock-on effects can be observed:

Pods won’t restart – Where pods reference an artefact on an external endpoint, such as IBM ACE referencing an off-cluster BarURL, the artefact (e.g. BAR file) will be pulled as part of the deployment which will result in an error.

Additionally, automation might trigger a redeployment despite disconnect between Kubernetes and a dependent system e.g., Git triggers a push/pull but the artefact repository is inaccessible. More importantly, Kubernetes has the ability to evict and redeploy pods which poses a risk of applications not being able to restart.

Inability to deploy new integrations – Whilst generally this would not impact consumers or cause a regulatory concern there is a risk that you won’t be able to deploy new integrations during the outage. Given Git importance, this could impact a large amount of the organisation at the same time cost with an inability to complete planned work.

Potential loss of all versions of truth – If Git is where all data is stored, you are susceptible to losing everything at once including your infrastructure and application data and configuration. This means relying on good documentation and employee knowledge to get the system back to where it should be.


There are several mitigations that can be employed to de-risk an outage.

To prevent an error when a Pod Restart occurs, you can opt to configure all the information about a container into the image. For example, add the bar file to a custom ACE image which is deployed instead. You can read more about how much to configure into your ACE image here: This can be applied to other deployments such as DataPower and MQ too with embedding certificates into the image to reduce reliance on secret management tooling.

Additionally, it is possible to store runtime dependencies locally on the OpenShift cluster (local BAR store, ConfigMaps) or something more reliable like a highly available S3 bucket. This means that in the event of an outage, the Pod Restart will be successful as the artefacts can be extracted close to the deployment. This requires pipelines to manage and synchronise to the local storage location.

When planning your deployment strategy and failure scenarios, determine if the source code repository should be functioning as a runtime dependency. In most cases, the answer should be not, but this becomes more subjective when we consider external artefact and image repositories.

If the code repository is central to a deployment strategy then consider what options are available for the outage of your code repository (Git). One option is to have a core technology manual deployment strategy in order to deploy objects manually. This can be a documented runbook to reference local objects, or the deployment of a standard CP4I image (MQ / ACE) and then to deploy application objects on top. Whilst this is laborious, it does provide a disaster recovery scenario to allow you to deploy new integrations.

Many source code repositories are distributed, as such they can be cloned to multiple locations and have the artefacts (e.g., BAR files) in a local copy – with the local (OpenShift) clone synced to the remote/central clone.

Finally, ensure that your code repository (Git) is backed up, replicated and recoverable across multiple locations to reduce the Potential loss of all versions of truth and ensure continued service for your organisation.