Cloud Platform as a Service

Cloud Platform as a Service

Join us to learn more from a community of collaborative experts and IBM Cloud product users to share advice and best practices with peers and stay up to date regarding product enhancements, regional user group meetings, webinars, how-to blogs, and other helpful materials.

 View Only

Supercharging Resiliency with Ingress on IBM Cloud Kubernetes Service

By Lucas Copi posted Mon January 15, 2024 02:57 PM

  
In the era of distributed systems, it is critical that applications are fault tolerant, highly available, and resilient to environment changes.

By using Ingress on IBM Cloud, you can focus on application availability and trust IBM Cloud to provide the most resilient and secure environment for those applications to run in.

In this blog, we explain the basic concepts and default settings of Ingress in IBM Cloud Kubernetes Service. We will also cover scenarios for updating the Ingress resources in your cluster. We'll show you some more advanced use cases such as autoscaling and edge networking. And finally, we'll cover how to monitor and visualize your Ingress setup using IBM Cloud observability services like Activity Tracker and IBM Cloud Monitoring.

Ingress overview

Every IBM Cloud Kubernetes Service (IKS) cluster is created with a public in-cluster ingress controller (ALB) in each zone, a default ingress domain, and a TLS certificate, for exposing applications in the cluster. These components are configured to provide high availability and resiliency for the downstream applications

A public, in-cluster Ingress controller (ALB) in each zone
Each ALB is a multi-replica Kubernetes deployment with anti-affinity rules to prevent replicas from scheduling on the same node, they are deployed with a pod disruption budget to ensure Kubernetes will always attempt to maintain the availability of the deployment, and the upgrade strategy ensures replicas of an older version are not removed before the new replica is healthy. This makes the ALBs resilient to cluster maintenance, node events, updates, and cluster scaling.

A default Ingress domain
The default Ingress domain is registered with the loadbalancer address of all public ALBs, and is automatically updated whenever there is a change to the ALBs in the cluster -- such as enabling or disabling an ALB, or creating a new ALB. The domain is configured with healthchecks to help direct traffic to only zones in the cluster that are marked as healthy. In IBM Cloud classic infrastructure, the healthcheck is done within DNS and uses the default albhealth endpoint to determine the readiness of each zone. Zones that are not healthy will not have their IP reflected in the dns lookup response. In IBM Cloud VPC infrastructure, the healthcheck is carried out on the loadbalancer members as well as the backend node pools. Cluster nodes that are not marked as healthy will not have traffic forwarded to them from the loadbalancer, and unhealthy loadbalancer members will not be reflected in the lookup for the loadbalancer hostname. You can see the different traffic routing flows for Classic and VPC below.

(Classic Network Flow)
(VPC Network Flow)

Updating Ingress components in your cluster

By default, version upgrades and security fixes are automatically applied to the ALBs to ensure clusters are always running the most secure and up-to-date code. The in-cluster ALB update process supports time-of-day maintenance windows and cluster concurrency rules. These advanced configurations allow ALB updates to be applied to the cluster during known periods of low traffic, or at a time when there will be the least impact to application end users. For more details about configuring maintenance windows, see this setup guide.

For users with complex ingress configurations, or users who have a preprod cluster pipeline and prefer to validate and update ALB versions manually, the ALB auto-update can be disabled. See the ALB update guide for the steps on manually managing ALB versions.

The ALB is based on the opensource Kubernetes Ingress-Nginx project and the IBM Cloud Kubernetes Service ALB versions align with the community releases. The IKS ALB versions are in the format <major>_<minor>_<patch>_<build>_iks, with the build number specific to the IKS release process.

Each new ALB version is vetted in a weekly pipeline running over ten thousand tests against all supported IKS cluster versions and infrastructure combinations. These tests include the full suite from the upstream opensource project as well as IKS specific behaviorial tests. This rigorous testing prevents known regressions in new versions. When a new version is released in the upstream project, a new pipeline is created to validate the release before it is made available to users.

Our testing pipeline includes:
  • Manual review of the upstream change logs and code changes to identify any potential risks.
  • Successful test runs of the entire end-to-end suite including both Kubernetes Ingress project and IBM Cloud tests.
  • Performance validation for request handling.
  • Beta testing in our internal environments.

For both major and minor releases, the general cadence is to release the version to users for several weeks before marking it as the default version and updating all ALBs. This enables users to validate the behavior of the new version in their own pipelines before it is applied to their production clusters.

All new ALB versions (major, minor, patch, or build) have corresponding details published in the ALB changelog, and also have the release notes published under the Kubernetes Service component on the status page. The status page includes an RSS feed which can be added to an RSS watcher to provide notifications for new releases. See the Slack RSS integration as an example.

For additional resiliency, the ALB deployment in the cluster is regularly reconciled using a Kubernetes addon-manager to prevent unexpected or unwanted changes from affecting the ALBs.

Edge network and autoscaling scenarios

Handling high traffic load and securely sandboxing ingress traffic is critical for high performing applications. IKS ALBs support edge node scheduling to ensure the ALBs are scheduled to cluster nodes dedicated to handling ingress traffic. For each ALB operation (enable, create, update), the cluster will be inspected for edge nodes; if enough edge nodes exist in the cluster to handle the ALB replicas, the ALB deployment will automatically be updated to require scheduling the pods to the corresponding edge nodes in the zone. This allows the ALB pods to be isolated from the application pods, improving performance by dedicating bandwidth and CPU usage to both the ALBs and the applications. For an advanced setup, application worker-pools can be private-only ensuring that all external traffic flows only through the edge nodes. See the IKS edge node documentation for more details.

To ensure your ALBs are configured correctly to handle application traffic load, the ALBs support horizontal pod autoscaling. This allows the ALBs to be scaled up and down based on actual usage metrics; preserving resource consumption during periods of low traffic, and scaling to meet demands during periods of high traffic. For more details see the ALB autoscaling documentation, or check out the previous Ingress blog with a step by step solution guide.

Visualizing and monitoring your Ingress setup

IKS provides the ability to connect an IBM Cloud Monitoring instance and IBM Cloud Activity Tracker instance to a cluster to visualize and monitor ALB behavior. The steps below provide a guide for creating AT event visualizations in the UI and connecting them to an IBM Cloud Monitoring instance.

For any change made to the ALB deployment, including customizations, or for any ALB version update applied to the cluster, an IBM Cloud Activity Tracker event is emitted. The events can be seen by searching for the following keywords in the Activity Tracker instance: cluster-alb-version.update, cluster-alb-deployment.updatecluster-alb-deployment.create. The steps below provide details on how to configure event routing for ALB events.

1. Create a view in the Activity Tracker instance for ALB version or deployment updates. For more details on configuring views for an Activity Tracker instance, see this user guide

2. Create an alert based on the new ALB view, and select Sysdig for the destination. For more details on creating alerts in Activity Tracker, see this user guide

3. Add the Sysdig details into the alert configuration. For more details on the Activity Tracker and Sysdig integration, see this user guide

4. Once the view and alert have been created, issue an update for an ALB and see the event visualization in Sysdig for your cluster and ALB

With IBM Cloud Monitoring, you can also use the precreated nginx-ingress dashboard to visualize the behavior of your ALBs and upstream applications. See the Ingress monitoring documentation for details on how to configure the monitoring instance and load the dashboard. Once loaded, the dashboard will provide valuable information on the health of both the ALBs and the upstream applications.
The annotated sections of the dashboard below highlight important information:

1. The configuration section highlights whether the ALBs have successfully processed the Ingress resources in the cluster and constructed an nginx configuration. Errors in the Config Errors section indicate there may be a bad Ingress resource in the cluster. The nginx configuration will be reloaded anytime a resource the ALB is watching (Secret, Configmap, or Ingress resource) is modified and causes changes to the nginx configuration. A high amount of configuration reloads could indicate a rogue process that is constantly updating resources in the cluster.

2. The controller error rate indicates the rate of 4XX and 5XX responses for the ALB. These responses could be caused by multiple things, and it is a good indicator something is not healthy with the backend application. Pairing this with an alert will help you recognize when end users may be seeing some instability.

3. The latency and response time section provides details about the time it takes for requests to complete. High upstream latency can indicate an issue connecting to the upstream application pod. This could be caused by a networking error, or an unhealthy node, but is also often caused by an upstream pod being resource starved. Checking the resource usage (CPU, Memory, Network and Disk I/O) of the application may indicate the need for a scale event.

4. The traffic and connections section highlight the amount of active connections to the ALB, the request volume, and the corresponding network I/O related to the traffic. A sudden change in this section of the graph can indicate an issue. If the connection count suddenly decreases there may be an issue with the loadbalancer or cluster nodes preventing clients from connecting to the ALB. If a large spike in network traffic occurs, the ALBs or nodes may need to be scaled to handle the load. It may also indicate a rogue process or bad actor attempting to make large amounts of requests.

5. The CPU and memory section highlight the amount of resources consumed by the ALB. CPU is the limiting factor for ALB performance, sustained high CPU usage could require scaling to keep up with the traffic demand. Correlating resouce usage with 4xx/5xx repsonses and request times can help narrow down request latencies that could be occurring due to resource constraints.

When combined with other cluster monitoring for Kubernetes resources, nodes, and upstream applications, the Ingress Nginx dashboard can help users get an overall snapshot of the health of their environment. Combining the dashboard with threshold alerting can help call out engineering teams when there is an issue that could affect the end user.



Conclusion

In this blog, we provided a quick overview of the default Ingress components that you get when you create an IBM Cloud Kubernetes Service cluster.
We also covered strategies for updating Ingress in your cluster including leveraging automatic updates. Finally, we explained how to maintain observability into your Ingress setup with IBM Cloud Logging and Monitoring tools.

The scenarios and strategies covered in this blog allow you to keep your Ingress setup up-to-date, healthy, and running while also feeling certain that your apps are running in the most resilient and secure environment in IBM Cloud Kubernetes Service.



Contact Us

For more information about Ingress components and getting started with Ingress on IBM Cloud, check out the official getting started documentation. If you have questions, engage our team via Slack by registering here and join the discussion in the #general channel on our public IBM Cloud Kubernetes Service Slack.
0 comments
20 views

Permalink