watsonx.data

watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics

 View Only

watsonx.data OpenTelemetry : Overview and Configuration

By MIKA SHIMOGAWA posted 14 hours ago

  

In this article, I want to introduce what watsonx.data OpenTelemetry feature is and how to configure it.
OpenTelemetry is a highly customizable serviceability framework that is designed to enhance monitoring and debugging capabilities. It facilitates generation, collection, and management of telemetry data such as traces and metrics to observability dashboards.
In watsonx.data 2.2.0, OpenTelemetry is supported for Presto (Java) engine and Milvus service.

  • Trace: Represents the lifecycle of a single operation or request as it propagates through a system, capturing spans to detail its execution across services
  • Metrics: Provide numerical measurements that reflect the performance, health, or behavior of a system, such as request counts, error rates, or resource utilization over time

image.png

  

Here, I want to introduce OpenTelemetry in watsonx.data using Instana as the backend, focusing on the next points.

  • Technical overview
  • How to configure in watsonx.data
    • Configuration Verification
  • Resources required by the ibm-lh-otel container

  

Technical Overview

After watsonx.data v2.1.2, the OpenTelemetry Collector is used to connect to Instana’s otlp-acceptor component. (The Instana agent is not used.)
When the OpenTelemetry function is enabled in watsonx.data, ibm-lh-otel is added as an Init Container for Presto and Milvus pods, and it is responsible for the OpenTelemetry function of watsonx.data.
Next diagram illustrates ibm-lh-otel is added to every Presto(Java) coordinator and worker pods and transfer the data to Instana.

  

image.png

  

How to configure OpenTelemetry in watsonx.data

Next is the steps to configure in watsonx.data 2.1.3 and 2.2.0 using Instana as the backend.

  1. Log in to watsonx.data console.
  2. From the navigation menu, select Configurations, and click OpenTelemetry tile.
    image.png
  3. In the OpenTelemetry page, click Diagnostic +.

(3-1) "Available Telemetry Tools" : In this scenario, Select Instana
(3-2) "Telemetry Endpoint" : Enter the endpoint URL of the selected tool. Format of the endpoint: http://<host>:<port>/<path> or https://<host>:<port>/<path>. Use port 4317 for OTLP over GRPC and 4318 for OTLP over HTTP.

(3-3) Host Name : This value is assigned to x-instana-host and should be a meaningful identifier that helps link telemetry data to its source. (I use string which represents my watsonx.data cluster as Host Name.)
(3-4) Password : This value is assigned to x-instana-key, that is Instana agent key (used to authenticate with the Instana backend). Instructions for retrieving the Instana agent key can be found in the Instana documentation.
(3-5) TLS enabled : (skip because TLS verification is not functional on watsonx.data 2.1.3 and 2.2.0.)
(3-6) Associated diagnostics : check on diagnostic type.
  image.png
(3-7) Click "Add". OpenTelemetry panel shows "Enabled".

  

Configuration Verification

After a while, the Presto engine and Milvus service will restart. Verify that they restart successfully.
Below shows the presto and milvus pod status using oc get pod command after the pods restart successfully.
Some pods have 2/2 which shows the pod has 2 containers and 2 containers are 'READY'. That says, the ibm-lh-otel container is added to the pod, and it restarts successfully.
In next example, Pods which is added with ibm-lh-otel container are

  • Milvus: datacoord, datanode, indexnode, proxy, querycoord , querynode, rootcood
  • Presto(Java): coordinator, worker
  • Presto(C++): coordinator

In watsonx.data 2.1.3 or 2.2.0, OpenTelemetry does not support Presto (C++), but the ibm-lh-otel container is created in the coordinator.

$ oc get pod | grep -E "prest|milvus|NAME"
NAME                                                   READY STATUS   RESTARTS  AGE
ibm-lh-lakehouse-milvus838-datacoord-7b7d44bd6f-9twkh  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-datanode-7fcd765d79-67jl7   2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-etcd-0                      1/1   Running  0         12m
ibm-lh-lakehouse-milvus838-etcd-1                      1/1   Running  0         12m
ibm-lh-lakehouse-milvus838-etcd-2                      1/1   Running  0         12m
ibm-lh-lakehouse-milvus838-indexnode-55f47ff48d-l96zf  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-indexnode-55f47ff48d-rr5vm  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-kafka-0                     1/1   Running  0         12m
ibm-lh-lakehouse-milvus838-kafka-1                     1/1   Running  0         12m
ibm-lh-lakehouse-milvus838-kafka-2                     1/1   Running  0         12m
ibm-lh-lakehouse-milvus838-proxy-6f4ccbbcc4-dtqkm      2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querycoord-7fb5cf48d9-2kdpr 2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-5hnm7  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-9d467  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-cjthv  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-k9qb7  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-lsjwh  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-r2nch  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-r7pjd  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-rmr5r  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-w5gxc  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-querynode-785d78cbbb-zqj62  2/2   Running  0         12m
ibm-lh-lakehouse-milvus838-rootcoord-7688bd8c58-k4zrn  2/2   Running  0         12m
ibm-lh-lakehouse-prestissimo38-coordinator-blue-0      2/2   Running  0         13m
ibm-lh-lakehouse-prestissimo38-prestissimo-worker-0    1/1   Running  0         13m
ibm-lh-lakehouse-presto899-coordinator-blue-0          2/2   Running  0         14m
ibm-lh-lakehouse-presto899-presto-worker-0             2/2   Running  0         14m
ibm-lh-lakehouse-presto899-presto-worker-1             2/2   Running  0         14m
ibm-lh-lakehouse-presto899-presto-worker-2             2/2   Running  0         14m
$

If the connection to the telemetry endpoint fails, Init:CrashLoopBackOff is dispalyed. In this case, you need to reconfigure.

Resources required by the ibm-lh-otel container

The OpenTelemetry feature requires additional resources.
The minimum resource required by the ibm-lh-otel container is 250 mm CPU (1/4 CPU) memory: 256 MB.

$ oc describe pod ibm-lh-lakehouse-presto899-coordinator-blue-0
(skip)
Init Containers:
  ibm-lh-otel:
(skip)
    Limits:
      cpu:                1
      ephemeral-storage:  550Mi
      memory:             1024M
    Requests:
      cpu:                250m
      ephemeral-storage:  50Mi
      memory:             256M
(skip)

In the example on the previous output of oc get pod, ibm-lh-otel was added in 22 pods.
In this case, you will need at least 5.5 cpu and 5.5G memory. It is in a very small environment. The required resources will increase depending on the number of engines and PODs.

Conclusion

In this article, I introduce OpenTelemetry in watsonx.data.

  • Technical overview
  • How to configure in watsonx.data
    • Configuration Verification
  • Resources required by the ibm-lh-otel container

I  posted articles for the following subjects

References

Environment

The example in this topic introduced is run in the following environment mainly.

  • OCP Version : 4.16
  • CP4D 5.1.3 , watsonx.data 2.1.3
  • Presto engines
    • Presto (Java) v0.286 / Size : Starter
    • Presto (C++) v0.286 / Size : Starter
  • Milvus service : Version v2.5.0 / Size : Small


#watsonx.data

0 comments
7 views

Permalink