In our ongoing efforts to improve performance and scalability within our IBM Instana Self-Hosted offerings, we recently transitioned from an Nginx-based gateway to an Envoy-based reverse proxy. This blog post walks you through the motivations behind this change, the architectural improvements it brings, and how we implemented the solution. The goal is to keep things simple while covering all the key topics relevant to our customers.
The Problem
Our legacy system was built around an Nginx-based gateway, which presented a few notable limitations:
-
Configuration Size Limit:
The Nginx configuration was stored in a Kubernetes ConfigMap, which has a 1MB file size limit. This restriction imposed a hard upper limit on the amount of routing configuration we could support per TU (Tenant Unit), which became a significant constraint as customers began creating more TUs in our new SaaS regions.
-
TLS and HTTP/2 Limitations:
We faced challenges using the H2c protocol (HTTP/2 over TLS) because Nginx couldn’t support placing the acceptor behind the gateway due to lack of HTTP/2 support while using proxy_pass. This led us to deploy two separate loadbalancers: one for the gateway and one for the acceptor. Consequently, the acceptor had to handle TLS termination, which increased the CPU utilization there too.
-
Domain and Port Limitations:
Our architecture did not support using a single domain for all types of ingress traffic across multiple ports—a requirement for many of our internal and customer use cases.
Why Envoy?
Switching to Envoy brings significant benefits that go well beyond overcoming the limitations of our previous Nginx architecture. Envoy’s modern design and advanced features make it the ideal choice for our Instana Self-Hosted offerings. Here’s how Envoy’s performance and flexibility translate into real-world improvements:
-
Dynamic Configuration and Scalability:
Unlike the static configuration pattern used with Nginx (limited by Kubernetes ConfigMap sizes), Envoy supports dynamic configuration via gRPC and REST API calls. This means we can update listeners and routing rules on the fly, allowing us to scale seamlessly beyond our previous tenant creation limitation in SaaS.
- High Performance and Low Latency:
Envoy is written in C++ and designed for high efficiency. Its highly optimized, event-driven architecture allows it to handle thousands of connections concurrently while maintaining low latency. This makes it perfect for high-traffic production environments.
-
Enhanced Loadbalancer Architecture:
Envoy’s support for multiple listeners allows us to differentiate between various types of ingress traffic by using different ports instead of multiple subdomains. This flexibility simplifies our network architecture, making it easier to fine-tune exposure based on customer requirements. For SaaS customers, while we still support subdomains to avoid agent reconfiguration, we can eliminate the extra acceptor loadbalancer—streamlining traffic management at the cluster level.
-
Simplified TLS Handling:
All TLS termination is now managed directly by Envoy. By offloading TLS responsibilities from the acceptor, we reduce CPU overhead and free up resources to focus on processing request bodies. This results in a more efficient and secure traffic handling process.
-
Observability and Operational Excellence:
With built-in Prometheus metrics and distributed tracing support, Envoy offers robust observability. We can gain deep insights into traffic patterns and performance bottlenecks, which are critical for proactive monitoring and rapid troubleshooting. Additionally, the Instana Agent can easily tap into these metrics, especially for self-monitoring scenarios.
-
Modern Proxy Features and Community Support:
Envoy is built to meet modern network requirements, offering fine-grained load balancing, advanced routing capabilities (through listeners, route configurations, virtual hosts, and clusters), and support for multiple protocols. Its design aligns with the best practices for micro-services and cloud-native architectures, which has led to widespread adoption in both open-source and enterprise settings.
Introducing the Gateway Controller
A major part of our new solution is the gateway-controller, a lightweight component deployed by the Instana Enterprise Operator when a Core resource is created. It acts as the control plane for our Envoy-based gateway-v2 reverse proxy, delivering real-time route configurations and dynamic upstream discovery.
Here’s how it fits into our new architecture:
Enhancing our Gateway Architecture
The new gateway solution with Envoy brings several improvements:
-
Reduced Use of Subdomains:
Instead of relying on multiple subdomains, we now differentiate traffic types using different ports. Envoy supports multiple listeners listening on different ports, each configured with its own route configuration and virtual hosts.
-
Streamlined Traffic Routing:
For our SaaS customers, although subdomains are still in use (to avoid reconfiguring the DNS entries for Instana Agent traffic), we’ve eliminated the need for a separate acceptor loadbalancer. Now, a single loadbalancer can handle all requests at the region level.
-
Improved Security and Isolation:
With Envoy, we can selectively expose specific ports—for example, exposing only essential Ingress traffic while keeping others internal—providing flexible and fine-grained control over traffic flow.
-
Unlimited Configuration Capacity:
Moving away from the ConfigMap-based configuration removes the upper limit on the number of TUs deployable in any given region, paving the way for better scalability.
-
Future-Proofing:
Recent Instana releases have introduced support for tenant UI endpoints to listen for traffic on subpaths rather than subdomains. Our new gateway-v2 architecture supports this transition using a single toggle in our Core custom resource, which will let customers further reduce their dependency on using subdomains for different tenant UIs.
Before vs. After
Old Gateway architecture
New Gateway-v2 architecture
Code Implementation Overview
Here’s a brief look at how the new system is implemented:
-
Gateway Controller as a Kubernetes Controller:
Written in Golang, the gateway-controller runs in the Core namespace and responds to Envoy DiscoveryRequests with DiscoveryResponses.
-
Lifecycle and Operation:
-
When a Core custom resource is created, the Instana Enterprise Operator deploys both the gateway-v2 (Envoy) and the gateway-controller Pods.
-
The gateway-v2 Pod is bootstrapped with an initial configuration from a ConfigMap.
-
The gateway-controller initializes an empty snapshot cache and starts an xDS server, handling multiple services (EDS, LDS, RDS, CDS).
-
It then begins a reconciliation loop—fetching and parsing the Core and Unit resources, building Envoy configuration snapshots, and sending them back to Envoy.
-
Every time the Core or Unit is updated, the controller rebuilds and re-applies the configuration snapshot.
-
Envoy Configuration Snapshot:
The snapshot includes:
-
Clusters: Represent upstream services (e.g., butler, UI client, acceptors).
-
Listeners: Define which ports Envoy listens on.
-
RouteConfigurations: Map listeners to routing rules.
-
VirtualHosts: Use domain and subpath matching to route requests to the appropriate upstream clusters.
-
Integration with the Instana Enterprise Operator:
The configuration templates for the Gateway-v2 and Gateway Controller Deployments have now been moved out of the backend, and are now embedded in the operator codebase, making deployment and updates more streamlined.
Conclusion
Switching from Nginx to Envoy has allowed us to overcome several key limitations:
-
We no longer face configuration size restrictions.
-
TLS termination and traffic routing have become more efficient.
-
Our loadbalancer architecture is simplified and more flexible.
-
We’re now positioned to support features like tenant UI endpoints on subpaths instead of subdomains.
This transition not only improves performance and scalability but also sets the stage for future innovations within our Instana Self-Hosted offerings.
By embracing Envoy, we’re ensuring that our infrastructure remains robust, efficient, and ready to meet evolving customer needs.
#Infrastructure
#Kubernetes
#Migration
#Self-Hosted
#Announcement