vLLMs provide observability features that help in maintaining efficient and reliable serving of large language models. These features include detailed runtime metrics such as encompassing throughput, latency, and resource utilisation, empowering users to fine-tune performance.
vLLM also offers comprehensive logging and tracing, which help identify bottlenecks and debug issues. This allows for a deeper understanding of the system’s behaviour, enabling proactive issue resolution and performance optimization.
Furthermore, integration with modern monitoring tools such as Instana simplifies the creation of custom and real-time dashboards. Instana’s ability to automatically discover and map dependencies helps users understand how vLLM’s performance is within the broader infrastructure. With Instana’s AI-powered insights, users can quickly pinpoint anomalies, identify root causes, and optimize vLLM’s operation for peak efficiency, ensuring a seamless and responsive experience for end-users.
Monitoring LLMs hosted with vLLMs
vLLM is an open-source, high-performance serving engine for Large Language Models (LLMs), developed by researchers at UC Berkeley. It is designed to maximise the throughput and efficiency of LLM inference, especially for real-time applications like chatbots and APIs.
You can now monitor your vLLM integrations seamlessly with Instana. By exporting traces and metrics from your vLLM applications to Instana, you can analyze calls and gain insights into your LLMs' performance.
Configuring the environment
Configure your environment to export traces to Instana either through an agent or directly to the Instana backend (agentless mode). To find the domain names of the Instana backend otlp-acceptor for different Instana SaaS environments, see Endpoints of the Instana backend otlp-acceptor.
To export traces to Instana using an Instana agent:
export TRACELOOP_BASE_URL=<instana-agent-host>:4317
export TRACELOOP_HEADERS="api-key=DUMMY_KEY"
To export traces directly to the Instana backend (agentless mode):
export TRACELOOP_BASE_URL=<instana-otlp-endpoint>:4317
export TRACELOOP_HEADERS="x-instana-key=<agent-key>,x-instana-host=<instana-host>"
Additionally, if the endpoint of the Instana backend otlp-acceptor or agent is not TLS-enabled, set OTEL_EXPORTER_OTLP_INSECURE to true.
export OTEL_EXPORTER_OTLP_INSECURE=true
Exporting traces to Instana
To instrument the LLM application, complete the following steps:
- Verify that Python 3.10 or later is installed. To check the Python version, run the following command:
python3 -V
- Optional: Create a virtual environment for your applications to keep your dependencies consistent and prevent conflicts with other applications. To create a virtual environment, run the following command:
pip3 install virtualenv
virtualenv vllm-env
source vllm-env/bin/activate
- Install vLLM and the Traceloop packages.
a. Install vLLM, run the following command:
pip3 install vllm==0.6.3.post1
To learn about other ways to install vLLM, see the vLLM documentation.
b. Install the Traceloop packages, run the following command:
pip3 install traceloop-sdk==0.40.9
Verify the installation and configuration. A sample application is shown in the following example:
import requests
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task
# Initialize Traceloop
Traceloop.init(app_name="vllm-client")
@task(name="create_payload")
def create_payload(prompt):
return {
"model": "ibm-granite/granite-3.0-2b-instruct",
"prompt": prompt,
"max_tokens": 10,
"n": 1,
"best_of": 1,
"use_beam_search": "false",
"temperature": 0.0,
}
@task(name="make_vllm_request")
def make_vllm_request(url, payload):
return requests.post(url, json=payload)
@task(name="process_response")
def process_response(response):
return response.json()
@workflow(name="vllm_client_workflow")
def run_chat():
vllm_url = "http://<vllm-server-host>:8000/v1/completions"
prompt = "San Francisco is a"
payload = create_payload(prompt)
response = make_vllm_request(vllm_url, payload)
result = process_response(response)
return result
if __name__ == "__main__":
result = run_chat()
print("Chat Response:", result)
The script sends a completion request to a vLLM server by running the IBM Granite 3.0 2B model and tracks the entire process with distributed tracing for observability in Instana.
The engine argument otlp-traces-endpoint can be used to configure the observability backend. You can serve the model by using the following command:
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=<instana-agent-host/instana-otlp-endpoint>:4317
vllm serve ibm-granite/granite-3.0-2b-instruct --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
Note: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT must define the HTTP scheme, unlike TRACELOOP_BASE_URL.
For example:
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317
You can also configure a service name for your vLLM server to group traces:
export OTEL_SERVICE_NAME=vllm-server
The following is an example trace view from Instana by running the preceding sample application: