SevOne

SevOne

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

“Is it the app or the network?” IBM full stack observability and event management

By Basak Vogt posted Wed July 31, 2024 05:11 PM

  

Everybody in IT is talking about observability, why it is so crucial, what you can achieve with and more importantly how it differs from simple monitoring.   Observability, when done right, is a foundation of service performance management, at every level, from applications to servers to network.

In this blogpost, we aim to explain why and how network performance management and application performance management systems can act together to build an enterprise observability solution. But there is more than just bringing two separate tools together to evaluate the performance of applications, systems, and networks. Enterprise observability, also known as full-stack observability, is an important data provider to an event and incident management solution based on AI tools. AIOps, as the name implies, using AI algorithms for IT operations, is extending network and application observability  to a next level:

-            to group correlating events from independent senders,

-            to detect performance (metric) anomalies,

-            to show dependencies between apps, systems, and networks in a topology,

-            to identify an extended blast radius of apps and networks,

-            to highlight the root cause, either graphically or textually,

-            to point to an easy-to-understand resolution actions.

In today's highly connected hybrid multicloud data centers, the backbone of successful businesses rely on network infrastructure functioning as the technological highway for essential applications. To ensure an optimal user experience, applications and networks must provide consistent service, reliable access, and continuous performance. Together, IBM SevOne, for network observability, along with IBM Instana for application observability    offer detailed application-centric insights to facilitate quicker identification and resolution of issues.. Both product sets can be integrated with the IBM Cloud Pak for AIOps  via out-of-the-box connection technologies (webhook) and generic REST APIs.

SevOne can be configured to send issue and incident information as events to Cloud Pak for AIOPs, and AIOPs can be configured to request topology information from SevOne via a topology observer. Instana can be configured to send issue and incident information as events and hundreds of performance KPIs as metric data to AIOPs whereas AIOPs can be configured to request topology information from Instana via a topology oberver.

SevOne and Instana events can be forwarded by using json value pairs to an AIOPs webhook, which is able to receive variables and the variables will be inserted to text alerts, so that the SRE can easily detect a problematic device or failing object immediately. In many cases, these variables already can contain the root cause for an issue, like a failing adapter or defect port and AIOPs can use them to highlight the probable cause for an outage. The goal of AIOPs here is to combine and further analyze all available information about an issue, which can come from not only one but various sources, called senders.

In this blog we evaluate a common issue and use case, covering an application performance problem due to a network adapter failure that we replicated in an IBM lab.   A typical use case in this area can be the connectivity between an application server and a database, which run on different VMs or containers. Imagine, there is a fast network and a slower network connection between both components, app server and database. If the fast network fails, and the slow network has to take over, the application might suffer from network performance problems due to high transaction rates to the database.

SREs want to detect this kind of issue very quickly and we will show how IBM Cloud Pak for AIOps  is able to correlate multiple issues from different senders to an incident and highlight the root cause of the issue in the alert console and topology view.

Figure : Architectural Overview of Daytrader App, SevOne, Instana, AIOps

Instana gathers event and metric data from servers and applications by a lightweight agent running locally on each server or VM to be observed. Instana requires only a single agent for a host. The Instana agent runs as a separate process and discovers technologies that are running in your environment. Based on the discovered technologies, the agent automatically deploys technology-specific sensors, which send appropriate metrics back to the agent. The Instana agent sends these metrics to the Instana backend. Instana discovers the interacting components of an application as services and shows them in a dependency view.

Figure : Instana Application Perspectices and Services

Via the out-of-the-box integration of the Instana backend to the IBM AIOps platform, you can import Instana application perspectices into the AIOPs applications’s view.

Figure : Applications in AIOps Resource Management View

SevOne Devices and Objects

SevOne gathers details of devices and objects about interface speed, link size, MTU and more and updates them over time by using SNMP and ICMP polling, That information is continually refreshed every few minutes with the most current data from the device itself. Leveraging that data set you can create an alert policy (threshold) that will look for expected values and flag them when not met to continually assure the network. It can work for a quick check, an audit or to track progress of an existing change effort.

Figure : SevOne Device Manager

Device Groups and Object Groups in SevOne

SevOne “scans” the network and identifies responding devices. If you have enabled the snmp daemon on a Linux VM, you can do an “SNMP walk” in SevOne NMS and retrieve many helpful resources of a Linux VM like hardware resources, adapter information, disk usage and processes. SevOne identifies the devices and corresponding objects. You can see the objects in the Object Manager.

Figure : SevOne Object Manager

The Device Group and Object Group information is important when defining policies, as the policy will check the devices and objects that are defined in the respective policy.

Figure : SevOne Policy Editor

Define a policy in SevOne

Using the SevOne policy browser, you define a policy that triggers an alert to IBM AIOps via the defined webhook. In this example, we have defined a policy for an Ethernet interface when an adapter goes down. SevOne will detect if an adapter availability is less than 100% for 1 minute and will trigger an alert to IBM AIOps based on the defined condition.

Figure : SevOne Policy Editor

The policy condition defines when the availability is less than 100% for a minute and in the custom message you can add variables coming from the SevOne Device and Object Manager.

Figure : SevOne Policy condition

Network interface adapter failure

In SevOne, you can detect an adapter failure in either

SevOne Data Insight or the SevOne admin UI  In SevOne Data Insight  you get a comprehensive overview of current issues and and can quickly identify the respective components.

Figure : SevOne Data Insight report and alert overview

If an adapter might fail, some variables like the device name ($deviceName) and object name ($objectName) can be passed as variables to the alert custom message, that will be forwarded to IBM AIOps. The SevOne policy will inform IBM AIOps by sending an alert with the failing object and the SRE can immediately see the root cause and the blast radius of this kind of adapter or network interface failure.

Figure : SevOne Alert Summary and Details

The corresponding alerts show up in the AIOPs alert console with “IBM SevOne” as the sender and the trigger message and the custom message of the SevOne policy (basak-georg-network-interface-policy). In the alert, the failed object name (enp7s0) is being included.

Figure : AIOps Alerts from SevOne

Instana also detects the issue with the failed adapter and reports the issue from an application perspective, like permanent TCP retransmissions, drop in the number of requests and erroneous calls too high.  The corresponding issues are shown in the Instana Event console :

Figure : Issues in the Instana Event Console

When Instana and SevOne are used together to identify this kind of issue (network interface down, permanent TCP retransmissions, erroneous call rate is too high), Instana will create an event and report an application  performance issue to AIOPs with multiple alerts. Based on the performance metrics sent to AIOPs, the AI Manager will detect a performance metric anomaly.

SevOne itself will detect that a networking component (adapter or network port) has failed and will send multiple alerts to AIOPs, depending on the conditions defined in the SevOne policies and containing the failed adapter as a object variable. The correlation algorithms in AIOPs then combine the various alerts from different event sources (senders) to a “grouped alert” and the golden signals  in the pre-defined AIOPs policies will create an AIOPs incident, based on the policy that includes the conditions to trigger the creation of the incident. The grouping of alerts can be achieved by multiple ways. In our case we activated the following trained and pre-trained AIOps Algorithms in the AI model management view :

-           Metric anomaly detection

-           Temporal grouping

-            Golde signal alert enrichment

-            Probable cause

-            Scope based grouping

-            Topological grouping

--      

Figure : AIOPs trainable and pre-trained AI Algorithms

The correlation algorithms in AIOPs combine the various alerts from different event sources (senders) to a group of alerts based on the golden signals :

Figure : AIOps Alert Console

In the SevOne Alert Console and AIOps Alert Console you also get the corresponding network interface failure from the SevOne alert:

Figure : SevOne Alert Details

Figure : Figure : AIOps Alerts from SevOne

In the AIOps Topology Viewer, you can see that the failing VM and the failing network interface “enp7s0” are highlighted, where the network interface went down and the Erroneous calls on the VM increased.

Figure : Topology viewer showing the root cause and blast radius

By this view the SREs can easily identify the root cause of the issue. The key point here is that we have defined matchTokens in the AIOps Resource Manager, so that alerts will be matched with topology resources and highlighted in the topology viewer. A more detailed overview of how to define matchTokens can be found in this blog :

https://community.ibm.com/community/user/aiops/blogs/zane-bray1/2024/03/11/cloud-pak-for-aiops-tips-setting-which-alert-field

Summary

“Is it the app or the network?”   Full-stack Observability provides both a holistic view of your full IT stack, from applications to servers to networks, as well as a granular view—down to the code level—of your applications and the endpoints they operate within. A full-stack observability practice supports incident detection, investigation by AI algorithms and response, while also giving you the information you need to proactively improve the state of operations of your IT stack including networking. By combining the capabilities of SevOne, Instana and AIOps, IBM offers Enterprise Observability and Event Management for the entire IT stack, supported by trainable and pre-trained, industry leading AI algorithms for Event and Incident Management.

For more information about IBM’s leading AI algorithms refer to this blog :

https://www.ibm.com/blogs/digitale-perspektive/2024/02/the-ai-in-ibm-aiops-algorithms-in-focus/

For more information about IBM’s leading SevOne Automated Network Observability refer to the blog posts :

https://community.ibm.com/community/user/aiops/communities/community-home/recent-community-blogs?communitykey=fe9d91df-352c-4846-9060-189fd98d00ca


#TechnicalQuery
#BestPractices

0 comments
31 views

Permalink