In case you haven’t heard, it’s very difficult for organizations to monitor their applications, many of which are now large, distributed, and use any number of cloud platforms. This challenge is familiar to operators and admins of IBM Cloud Object Storage (COS) deployments. We’ve heard countless calls from our on-premise customers seeking help. They know we have many metrics and numerous logs that can provide deep insight. They are also aware that because the observability market is worth billions of dollars, with many vendors offering solutions to handle all of that data.
After consideration, it became clear we needed to integrate the OpenTelemetry Collector into COS nodes. The OpenTelemetry Collector is a robust, scalable, aned most importantly, a vendor-agnostic tool that can receive, process, and export our telemetry.
In this blog, I will share some of our experiences and hope it provides some insight into how the OpenTelemetry Collector, or other OpenTelemetry solutions, can be adopted into your product.
A Brief Primer
IBM COS can be deployed in various configurations. Primarily, a COS deployment consists of storage nodes, access nodes, and one or two manager nodes. These nodes are physical or virtual appliances. Manager and access nodes can also be deployed as Docker containers.
Like most large distributed systems, we produce metrics--lots of metrics. We used those metrics internally, but we found it difficult to deliver a solution for on-premise consumption that we could maintain with reasonable development cost and yet be plugged into any conceivable deployment of COS. We tried several options, but we would get stuck either trying to do too much and handle all of the complex aggregation and visualization or, if we didn’t do those things, trying to guess which tools our customers might use.
It wasn’t until we really focused on broader observability space and, the OpenTelemetry Collector in particular, that we had our eureka moment. We could just not worry about those use cases and focus on what we do best: ensuring our storage software emit metrics in a way that the collector can understand and let it do the rest (i.e., process/transform metrics and export).
OpenTelemetry Collector Integration Strategy
At this point, we knew wanted to use the OpenTelemetry Collector to handle the heavy lifting. We just had to figure out how what that meant for us.
We decided to use the collector as an agent that can, optionally, run on all of the COS nodes or subset of nodes. This approach ensured that only the collector would export telemetry data off of a COS node, streamlining the data flow and reducing potential points of failure. The collector can also be used as a “gateway”, but we opted against that approach because it starts to bake in assumptions about the end-to-end pipeline that we were intentionally trying to avoid.
Then, we created our own OpenTelemetry Collector distribution because we knew we needed to be as lightweight as possible since we’re going to be running this on nodes whose primary function is storage and not observability. This custom distribution has a minimal footprint on our COS nodes while maintaining the essential features required for effective COS metric collection and processing. The interesting thing about the collector is that can handle all sorts of telemetry signals besides metrics. So, should COS ever expand beyond metrics, we should be able to maintain as small as a footprint as possible.
We also updated our management application, that runs on the manager node(s), to simplify the configuration process for our customers. By offering a reduced set of configuration options compared to the full capabilities of the components we included in our distribution of the collector, we made it easier for users to set up and manage without being overwhelmed by the additional complexity.
It ended up looking like this:
OpenTelemetry Collector Integration Challenges
No plan is perfect, and ours was no exception.
The documentation for certain elements of the OpenTelemetry Collector is still a work in progress. The open-source nature of the projected allowed us to examine the code directly to understand how specific features worked, but this can be time-consuming. This can make it challenging for teams to fully leverage the capabilities of the collector without diving into the source code so allocate time accordingly.
Additionally, the stability of the collector's components varies significantly. Some of the more interesting and potentially useful components, such as the filter processor, are still in the alpha stage. This early stage of development means that these components may not be fully stable or reliable, which can pose risks when integrating them in enterprise-grade solutions. However, some of the risks can be mitigated if you are willing to take a “do-it-yourself” approach.
Speaking of components, the collector is designed to be very pluggable and extensible and thus has a lot of components and features. It may seem overwhelming at first even to experienced engineers. To reduce the cognitive load on our operators and admins, we found it critical to obscure many of the inner workings of the collector. This approach ensures that operators don't have to learn new concepts or configurations, making the whole configuration process smoother and more manageable.
I mentioned earlier that we already had thousands of metrics for internal use. These metrics were consumed by various tools/products across the organization. While these tools were high quality when introduced, they had become extremely outdated and deeply ingrained in various workflows. Instead of attempting to update everything, which would have led to scope creep, we chose to delete parts that had minimal impact on others. This approach allowed us to modernize our system without disrupting existing workflow, though it was extremely time-consuming and necessary.
The most difficult part of this effort was communication. It is natural for people to become defensive if they feel their job may be disrupted. People like routine. So, if you find yourself in a situation where some disruption in necessary, make sure your lines of communication are open and be receptive to feedback. Be prepared to reiterate the plan repeatedly as reassurance is key.
Additionally, many of these metrics were hastily thought out and lacked meaningful names. We had to take the time to update these metrics to align with OpenTelemetry expectations, ensuring that they were well-defined and useful for monitoring purposes.
Lessons Learned
- Many teams find it difficult to transition to modern observability tools. This is especially true for teams with deeply ingrained traditional monitoring solutions. Make sure time allocate time and provide adequate training to bring everyone up to speed.
- Make sure you follow the recommendation of the OpenTelemetry team and create your own distribution. The OpenTelemetry Collector is extremely powerful. But, with great power comes great responsibility. In this case, that means only use what you need. Your engineers, and more importantly, your customers, will thank you.
- Do not be afraid to investigate the OpenTelemetry Collector code youself. In fact, be prepared for it. You will end up saving time in the long run.
Wrapping up
Observing applications in today's complex, distributed environments is a significant challenge, especially for IBM Cloud Object Storage (COS) deployments. Our journey to improve in this area led us to integrate the OpenTelemetry Collector into COS. This decision was driven by the need to streamline metric data flows off of COS services running on COS nodes.
It was not easy and took a considerable about of time and effort as it displaced some existing tools/workflows. However, with careful planning and effective communication we were able to overcome those obstacles.
Whether you're just starting out or have been working with OpenTelemetry for a while, we encourage you to share your own experiences or ask questions.