Co-authored by @Carlos Chivardi and @Brian Hall
“Success in life is not determined by what happens to you, but how you respond to what happens to you.”
Change is an inevitable part of life. Sometimes we can control when change happens and other times it is out of our control. To return to a situation that is normal or under control, the key is to respond adequately and effectively.
The need for normal stability is especially the case with mission critical applications, like IBM Sterling Integrator, that support an organization’s day to day operations. A faulty server (pod) or a disk crash, can cause an unplanned outage which may disrupt the supply chain operations for a company.
This blog series describes the “Self-healing” feature of IBM Sterling Integrator Certified Containers (SICC) running on IBM Red Hat Open Shift (RHOS) consisting of the following three parts:
- SICC Liveness Overview
- How SICC self-healing works
- Benefits of self-healing feature in SICC
In this blog we will cover the SICC liveness overview.Supply chain mission critical applications, like SICC, that run for long periods of time eventually transition to broken states and cannot recover except by being restarted . Container orchestration technologies, such as Kubernetes, provide liveness probes to quickly identify and correct such outages minimizing disruption to operations.
How do SICC and RHOS decide when to re-use or restart a pod? It uses health checks. There are two types of health checks on SICC running on RHOS :
- Readiness probe: Verifies if the container, which has just been created, is ready for service requests. If the container fails the readiness test, it is removed from the list of end points.
- Liveness probe: Checks if the running container in which it is scheduled is still running. If the container fails the liveness test, the Kube-Master kills it and then executes the next steps based on its restart policy.
This blog series focuses on how the second health check is executed. Now, the application health can be verified by issuing either:
- HTTP check: If http request return code is between 200 and 39,9 it is successful This is atypical use case for web applications
- Container execution check: When the command is executed in the container, exit code 0 is successfully achieved. SICC uses this approach by executing this script every number of seconds /ibm/b2bi/install/bin/b2biLivelinessCheck.sh. (see Figure 2 below)
- TCP socket checks: The Kube-Master tries to open a TCP socket to the container. This is a typical use case, which is related to non-HTTP applications.
Figure 1 below describes how the liveness probe is executed, and its frequency, is well defined at the Container Stateful Set YAML manifest.
Figure 2 below shows SICC’s my-release-b2bi-asi-server container settings Stateful Set YAML details. Settings like the health check frequency in the script is executed; with success and failure thresholds, i.e. how many chances the pod is given before it gets terminated.
In the second part of the blog post series, we’ll explain how the SICC liveness probe is executed and triggers the replacement of a faulty container with a healthy one.
Read the next blog of the Self-healing series here: IBM Sterling Integrator Certified Containers: Self-healing Scenario (Part 2)