GPU Observability With Instana
Overview of Solution
Observing the performance and health of GPUs is crucial in modern computing environments, especially for applications that require heavy computational power, such as machine learning and scientific simulations. Instana, a leading enterprise observability platform, offers comprehensive monitoring capabilities that extend to GPUs, providing insights into their utilization, performance metrics, and potential issues. This blog will guide you through setting up GPU observability with Instana, ensuring your GPU resources are optimized and any problems are promptly identified and resolved.
Reference Architecture
In this section, we will discuss the reference architecture for integrating GPU observability with Instana. The architecture involves setting up the Instana agent on systems with GPUs, configuring the GPU Operator, running the OTel Collector and ensuring the data collected is visualized effectively in the Instana dashboard.
We are using NVIDIA DCGM Exporter to be the source of truth for GPU resources, and otel collector will scrapt GPU metrics from NVIDIA DCGM Exporter and send to Instana.
We support two patterns in Instana to collect the GPU data:
- Agent Mode: For this pattern, the GPU data will be send to Instana Agent first, and the GPU sensor running on the agent will help aggregate the data and send to Instana Backend.
- Agentless Mode: For this pattern, the GPU data will be send to Instana Backend directly without going through the agent, the data will go to backend otel acceptor directly.
Pre-requisites
Before setting up GPU observability with Instana, ensure you have the following:
- Instana Cluster
- GPU-Enabled Systems: Servers or workstations with GPUs. For this tutorial, we are testing with a OCP Cluster which contains some GPU servers.
GPU Monitoring
When you install the NVIDIA GPU Operator and create an OpenTelemetry-based data collector, you can view metrics that are related to GPU in the Instana UI.
Installing NVIDIA GPU Operator
You can install the NVIDIA GPU Operator on your GPU environment that helps manage and collect GPU metrics. Enable components such as NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, and the NVIDIA Container Toolkit if you need them. For more information, see Nvidia GPU Operator
helm install gpu-operator \
--repo https://helm.ngc.nvidia.com/nvidia \
--namespace gpu-operator \
--create-namespace \
-set driver.enabled=false \
-set toolkit.enabled=false \
-set devicePlugin.enabled=false \
-set mig.strategy=single \
gpu-operator
The NVIDIA GPU Operator installs the NVIDIA Data Center GPU Manager (DCGM) by default. DCGM Exporter is a tool to NVIDIA DCGM that allows users to gather GPU metrics and understand workload behavior or monitor GPUs in clusters.
DCGM Exporter exposes GPU metrics at an HTTP endpoint (/metrics
) for monitoring solutions. For more information, see DCGM Exporter
OpenTelemetry-based Data Collectors
You can forward the OpenTelemetry data of GPU to an Instana agent or Instana backend by using the OpenTelemetry Collector. The OpenTelemetry Collector is the core component of the OpenTelemetry ecosystem, offering vendor-independent functions for telemetry data collection, processing, and export.
Forwarding Telemetry data to an Instana agent (Agent Pattern)
- To enable OTLP ports for an Instana agent, add the following snippet to the configuration.yaml file of your Instana host agent. Make sure to save the changes, and restart the Instana host agent to apply the modifications.
com.instana.plugin.opentelemetry:
grpc:
enabled: true
http:
enabled: true
- The following snippet shows a typical configuration for the OpenTelemetry Collector to forward telemetry data to a local Instana host agent by using the
OTLP/gRPC
protocol.
Create a YAML file, such as config.yaml, as follows:
receivers:
otlp:
protocols:
grpc:
prometheus/nvidia-dcgm:
config:
scrape_configs:
- job_name: 'nvidia-dcgm'
scrape_interval: 10s
static_configs:
- targets: "$(DCGM_EXPORTOR_ENDPOINT)"
processors:
batch:
resource:
attributes:
- key: server.address
from_attribute: net.host.name
action: insert
- key: server.port
from_attribute: net.host.port
action: insert
- key: service.name
value: nvidia-dcgm
action: update
- key: INSTANA_PLUGIN
value: dcgm
action: insert
exporters:
otlp:
endpoint: "$(INSTANA_AGENT_HOST):4317"
tls:
insecure: true
service:
pipelines:
metrics/nvidia-dcgm:
receivers: [prometheus/nvidia-dcgm]
processors: [batch, resource]
exporters: [otlp]
The following example shows a typical configuration of the OpenTelemetry Collector for forwarding Telemetry data to a local Instana host agent with the OTLP/HTTP
protocol.
exporters:
otlphttp:
endpoint: "$(INSTANA_AGENT_HOST):4318"
tls:
insecure: true
Notes:
- Set the
DCGM_EXPORTOR_ENDPOINT
field with the DCGM Exporter endpoint.
- Set the
INSTANA_AGENT_HOST
field with the IP or the host of the Instana agent to connect to.
- Instana uses OTLP standard port numbers, such as 4317 for
OTLP/gRPC
and 4318 for OTLP/HTTP
.
- After you complete all configuration changes in the
config.yaml
file, run the following command to use the OpenTelemetry Collector:
docker run -d -p 4317:4317 -v $(pwd)/config.yaml:/etc/otelcol-contrib/config.yaml otel/opentelemetry-collector-contrib:latest
Forwarding Telemetry data to the Instana backend (Agentless Pattern)
To forward OpenTelemetry data to the Instana backend by using the OpenTelemetry Collector, complete the following steps:
-
Create a YAML file, such as config.yaml described above. Change the endpoint
from Instana Agent endpoint to Instana Backend endpoint. The special endpoints of the backend otlp-acceptor component are used when OpenTelemetry data is sent. The Instana backend requires Instana agent key for validation. And the Instana backend also requires the host.id
, faas.id
, or device.id
resource attribute.
exporters:
otlp:
endpoint: INSTANA_OTLP_GRPC_BACKEND:4317
headers:
x-instana-key: xxxxxxx
x-instana-host: xxxx
Notes:
- Set the
INSTANA_OTLP_GRPC_BACKEND
field with the correct domain name of the otlp-acceptor
component of the Instana backend. For more information about the endpoint of the Instana backend otlp-acceptor
, see Endpoints of Self-Hosted Instana backend otlp-acceptor or Endpoints of SaaS Instana backend otlp-acceptor.
- Set the
x-instana-key
field with the Agent Key of the Instana agent for targeting the Instana backend. To find your agent key, you can click More > Agents in the navigation bar of the Instana UI and then click Install Agents > Windows.
- Set the
x-instana-host
field with the host ID if no host.id
, faas.id
, or device.id
resource attribute is defined in your application or system.
- Instana uses OTLP standard port numbers, such as 4317 for
OTLP/gRPC
and 4318 for OTLP/HTTP
. Port 443 is also supported for OTLP/HTTP
.
- After you complete all configuration changes in the
config.yaml
file, run the following command to use the OpenTelemetry Collector:
docker run -d -p 4317:4317 -v $(pwd)/config.yaml:/etc/otelcol-contrib/config.yaml otel/opentelemetry-collector-contrib:latest
Viewing Metrics
After you install OpenTelemetry (OTel) Data Collector, you can view the metrics in the Instana UI.
- Open the Instana UI, and click Infrastructure. Then, click Analyze Infrastructure.
- Select OTEL Dcgm from the list of types of the entities.
- Click the entity instance of OTEL Dcgm entity type to open the associated dashboard.
You can view the following GPU metrics:
Alerts for GPU
A Custom Event enables you to create issues or incidents based on an individual metric of GPU. For example, when the GPU temperature is too high, an alert will be displayed.