Instana

Instana

The community for performance and observability professionals to learn, to share ideas, and to connect with others.

 View Only

Keeping an Eye on vLLM: Real-Time Monitoring with Instana

By Lakshmi Priya S posted 5 days ago

  

Co-authors: @Elina priyadarshinee @Madhu Tadiparthi @Guangya Liu

vLLMs provide observability features that help in maintaining efficient and reliable serving of large language models. These features include detailed runtime metrics such as encompassing throughput, latency, and resource utilisation, empowering users to fine-tune performance.

vLLM also offers comprehensive logging and tracing, which help identify bottlenecks and debug issues. This allows for a deeper understanding of the system’s behaviour, enabling proactive issue resolution and performance optimization.

Furthermore, integration with modern monitoring tools such as Instana simplifies the creation of custom and real-time dashboards. Instana’s ability to automatically discover and map dependencies helps users understand how vLLM’s performance is within the broader infrastructure. With Instana’s AI-powered insights, users can quickly pinpoint anomalies, identify root causes, and optimize vLLM’s operation for peak efficiency, ensuring a seamless and responsive experience for end-users.

Fig 1: Monitoring vLLM with Instana

Monitoring LLMs hosted with vLLMs

vLLM is an open-source, high-performance serving engine for Large Language Models (LLMs), developed by researchers at UC Berkeley. It is designed to maximise the throughput and efficiency of LLM inference, especially for real-time applications like chatbots and APIs.

You can now monitor your vLLM integrations seamlessly with Instana. By exporting traces and metrics from your vLLM applications to Instana, you can analyze calls and gain insights into your LLMs' performance.

Configuring the environment

Configure your environment to export traces to Instana either through an agent or directly to the Instana backend (agentless mode). To find the domain names of the Instana backend otlp-acceptor for different Instana SaaS environments, see Endpoints of the Instana backend otlp-acceptor.

To export traces to Instana using an Instana agent:

export TRACELOOP_BASE_URL=<instana-agent-host>:4317
export TRACELOOP_HEADERS="api-key=DUMMY_KEY"

To export traces directly to the Instana backend (agentless mode):

export TRACELOOP_BASE_URL=<instana-otlp-endpoint>:4317
export TRACELOOP_HEADERS="x-instana-key=<agent-key>,x-instana-host=<instana-host>"

Additionally, if the endpoint of the Instana backend otlp-acceptor or agent is not TLS-enabled, set OTEL_EXPORTER_OTLP_INSECURE to true.

export OTEL_EXPORTER_OTLP_INSECURE=true

Exporting traces to Instana

To instrument the LLM application, complete the following steps:

  1. Verify that Python 3.10 or later is installed. To check the Python version, run the following command:
    python3 -V
  2. Optional: Create a virtual environment for your applications to keep your dependencies consistent and prevent conflicts with other applications. To create a virtual environment, run the following command:
    pip3 install virtualenv 
    virtualenv vllm-env
    source vllm-env/bin/activate
  3. Install vLLM and the Traceloop packages.
    a. Install vLLM, run the following command:
    pip3 install vllm==0.6.3.post1

    To learn about other ways to install vLLM, see the vLLM documentation.

    b. Install the Traceloop packages, run the following command:

    pip3 install traceloop-sdk==0.40.9

Verify the installation and configuration. A sample application is shown in the following example:

import requests
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task

# Initialize Traceloop
Traceloop.init(app_name="vllm-client")

@task(name="create_payload")
def create_payload(prompt):
    return {
        "model": "ibm-granite/granite-3.0-2b-instruct",
        "prompt": prompt,
        "max_tokens": 10,
        "n": 1,
        "best_of": 1,
        "use_beam_search": "false",
        "temperature": 0.0,
    }

@task(name="make_vllm_request")
def make_vllm_request(url, payload):
    return requests.post(url, json=payload)

@task(name="process_response")
def process_response(response):
    return response.json()

@workflow(name="vllm_client_workflow")
def run_chat():
    vllm_url = "http://<vllm-server-host>:8000/v1/completions"
    prompt = "San Francisco is a"
    
    payload = create_payload(prompt)
    response = make_vllm_request(vllm_url, payload)
    result = process_response(response)
    
    return result

if __name__ == "__main__":
    result = run_chat()
    print("Chat Response:", result)

The script sends a completion request to a vLLM server by running the IBM Granite 3.0 2B model and tracks the entire process with distributed tracing for observability in Instana.

The engine argument otlp-traces-endpoint can be used to configure the observability backend. You can serve the model by using the following command:

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=<instana-agent-host/instana-otlp-endpoint>:4317
vllm serve ibm-granite/granite-3.0-2b-instruct --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"

Note: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT must define the HTTP scheme, unlike TRACELOOP_BASE_URL.

For example:

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317

You can also configure a service name for your vLLM server to group traces:

export OTEL_SERVICE_NAME=vllm-server

The following is an example trace view from Instana by running the preceding sample application:

Fig 2: vLLM traces

Exporting metrics to Instana

vLLM exposes runtime metrics in Prometheus format at the /metrics endpoint of the vLLM server. You can collect these metrics by using the OpenTelemetry (Otel) Data Collector, which scrapes the endpoint and exports the data to Instana for monitoring and visualisation. To configure the data collector, see the official Instana documentation.

Instana monitors key performance and usage metrics from vLLM, such as:

  • Token usage (input and output tokens per request)
  • Request latency
  • Cache hit and miss rate

Once you complete the integration, you can access a prebuilt Instana dashboard that shows an aggregated view of these metrics across all your vLLM instances. This view helps you identify performance trends, bottlenecks, and potential optimisation opportunities.

Fig 3: vLLM metrics


#BusinessObservability
#OpenTelemetry
#LLM

0 comments
7 views

Permalink