Instana

The community for performance and observability professionals to learn, to share ideas, and to connect with others.

View Only

Back to Blog List

The Secret Sauce of Monitoring and Observability

By Matthew McElwain posted Fri May 02, 2025 03:45 PM

When I served as the IT Architect for a major retailer, I wasn't just responsible for designing systems and process. I was accountable for the trust people placed in those systems. Customers don’t care about your backend. They care that it works. Every second of slowness, every outage, and every unexplained issue chipped away at confidence, not just in the technology but in the teams behind it.

Building effective monitoring and observability wasn’t about flashy tools or endless dashboards. It was about creating a practice grounded in clarity, ownership, and accountability across the organization.

None of this would have been possible without a clear mandate from our executive leadership. They made it known that this was the process we were going to follow, and individual teams would need to work with us. Even with that support, we had to approach it politically. We spent a lot of time teaching, coaching, and collaborating so that people at every level of the organization could grow. That growth only happened because we earned trust. Teams needed to believe that what we were alerting on mattered. It was a slow and sometimes painful process, but it laid the foundation for our success.

1. The Non-Negotiables: Availability, Performance, and Fault Monitoring

Everything we monitored had to answer three fundamental questions:

Is the system up? (Availability)
Is it meeting performance expectations? (Performance)
Is it functioning without errors? (Fault Monitoring)

This sounds simple, but many teams miss the mark. They build intricate dashboards that don’t actually tell them if something is broken or if customers are impacted. We enforced these three pillars across every environment. If a metric or alert couldn’t clearly tie back to one of them, it didn’t belong in the system.

2. Correlation Across Tools: A Single Source of Truth

We used a variety of tools for application monitoring, infrastructure metrics, logging, synthetic checks, and more. Each had value, but on their own, they only offered a slice of the picture.

The turning point came when we began tying those signals together. A spike in application latency could be traced to a database bottleneck, which might stem from a misconfigured network route or an over-utilized storage pool. With correlated observability, those links became visible. Our dashboards showed the entire path, and our alerts included context that shortened the time from detection to resolution.

Correlation transformed isolated metrics into operational awareness. It helped teams move from guesswork to action.

3. Alerts That Actually Mean Something

Too often, alerts are created without a plan. They're set up because someone thought it might be helpful, or worse, because a tool suggested it by default. The result is predictable: noise, fatigue, and apathy.

We enforced a simple rule. If an alert wasn’t actionable, assigned, and documented, it didn’t go into production. No exceptions. We were ruthless about this because we had to be. Every alert had to matter. Every alert had to drive a response. Anything else created clutter and wasted precious time during real incidents.

This shift reduced alert volume and improved team confidence. People paid attention because they knew what they saw was relevant.

4. Automatic Incident Generation via the Service Management Platform

We connected our observability stack directly to our service management platform. When a high-severity issue was detected, an incident was created automatically. It included key details like logs, traces, impacted services, and related alerts.

This wasn't about replacing people. It was about accelerating response. By the time someone was paged, they already had the context they needed to take action.

I explain this approach further in my LinkedIn article, "The Marriage of ITSM and Observability." It's one of the most important integrations we made because it removed guesswork from the escalation process.

5. Smart Routing and Escalation

Not every issue is equal, and not every team should be involved in every incident. We built logic to route alerts and incidents based on domain and severity.

If a backend database was misbehaving, it went to the DBA team. If a third-party integration was slowing down checkout, the ecommerce engineers were paged. High-impact incidents triggered coordinated response plans that escalated immediately.

This approach reduced confusion and ensured that the right people were engaged at the right time. It also improved accountability by making ownership crystal clear.

6. Documentation as a Requirement

We made documentation part of the alerting process, not an afterthought.

No alert was allowed unless it came with a linked knowledge base article. If someone wanted to add an alert, they had to explain what the alert meant, why it mattered, and what action an operations engineer should take when it fired.

This requirement did two things. First, it stopped unnecessary alerts from ever getting into the system. Second, it sped up response time because on-call engineers had guidance ready the moment they were paged. It empowered junior staff and made senior engineers more efficient. Documentation turned our alerts into solutions instead of just notifications.

7. Training Across the Organization

We didn’t isolate observability as an operations function. We involved developers, QA, support teams, and even product managers.

Everyone had a role to play. Developers were taught how to add meaningful telemetry to their code. Product managers learned how to trace performance issues back to specific services. Analysts could dig into dashboards and explore customer-impact metrics.

We ran training sessions, office hours, and recorded deep dives. We invited questions and created space for people to explore the tools. That investment paid off. It built a culture where observability wasn’t something you asked someone else about—it was something you could engage with directly.

8. Tuning Is a Discipline

Nothing in observability is static. Our systems evolved, and our alerts had to evolve too.

We scheduled quarterly reviews of all alerts, thresholds, dashboards, and runbooks. We analyzed alert fatigue, suppression logs, missed escalations, and resolution timelines. If an alert had stopped being useful, it was updated or removed. If new services were introduced, new baselines were created.

Tuning wasn’t reactive. It was proactive. It kept our system healthy and our people confident.

Final Thoughts

This entire approach was made possible because leadership mandated it. But even with that backing, we had to do the hard work of building trust across teams. We had to listen, educate, adapt, and prove that our way worked. Over time, teams started to believe in the process because they saw the results.

We didn’t just build a monitoring system. We built a culture where teams trusted the data, took ownership of their services, and felt empowered to solve problems. That’s what made it work.

The tools helped. But the process, the collaboration, and the shared commitment to doing it right—that was the real secret sauce.

#General
#Ideas

1 comment

49 views

Permalink

https://community.ibm.com/community/user/blogs/matthew-mcelwain/2025/05/02/the-secret-sauce-of-monitoring-and-observability

Comments

Toby Phillippe

Thu June 05, 2025 07:36 AM

Very nice article and some great principles. Thank you.

Instana

Instana

The Secret Sauce of Monitoring and Observability

By Matthew McElwain posted Fri May 02, 2025 03:45 PM

1. The Non-Negotiables: Availability, Performance, and Fault Monitoring

2. Correlation Across Tools: A Single Source of Truth

3. Alerts That Actually Mean Something

4. Automatic Incident Generation via the Service Management Platform

5. Smart Routing and Escalation

6. Documentation as a Requirement

7. Training Across the Organization

8. Tuning Is a Discipline

Final Thoughts

Permalink

Comments

Additional
Resources

Office

Quick Links

Instana

Instana

The Secret Sauce of Monitoring and Observability

By Matthew McElwain posted Fri May 02, 2025 03:45 PM

1. The Non-Negotiables: Availability, Performance, and Fault Monitoring

2. Correlation Across Tools: A Single Source of Truth

3. Alerts That Actually Mean Something

4. Automatic Incident Generation via the Service Management Platform

5. Smart Routing and Escalation

6. Documentation as a Requirement

7. Training Across the Organization

8. Tuning Is a Discipline

Final Thoughts

Permalink

Comments

Additional Resources

Office

Quick Links

Additional
Resources