Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

 View Only

Demystifying Inferencing at Scale with LLM-D on Red Hat Openshift on IBM Cloud

By TYLER LISOWSKI posted 27 days ago

  

This blog illustrates how llm-d, running on IBM Cloud Red Hat Openshift, delivers cost-effective inferencing at scale of LLMs. The reader will gain insights into the key technologies that llm-d builds on to deliver a scalable inferencing solution and see how llm-d seamlessly integrates into existing AI toolsets like AnythingLLM. Going a level deeper: let’s look at how llm-d enables inferencing at scale through analyzing a deployment of llm-d on a IBM Cloud Red Hat Openshift Cluster running granite-3.3-8b-base and llama-3-1-8b-instruct models. We will dive into how each of the following llm-d innovations running on top of IBM Cloud Red Hat Openshift enable cost efficient high performance inferencing at scale:

·     VLLM: the open source de facto standard inference server

·     Prefill and Decode Disaggregation

·      KV (key-value) Cache Offloading, based on LMCache.

·     AI-Aware Network Routing 

VLLM

Let’s first start by looking at the model server backend powered by vllm of the live llama model and granite model:

Llama model

$ kubectl get pods -n llama-3-1-8b-instruct
NAME                                            READY   STATUS      RESTARTS   AGE
llama-3-1-8b-instruct-decode-d56d75c57-ndr8v    2/2     Running     0          3h26m
llama-3-1-8b-instruct-epp-85484474d-wpqrl       1/1     Running     0          3h26m
llama-3-1-8b-instruct-prefill-5d9796c8c-mqvvk   1/1     Running     0          3h26m

Granite Model

$ kubectl get pods -n granite-3-3-8b-instruct
NAME                                           READY   STATUS      RESTARTS   AGE
granite-3-3-8b-instruct-decode-796dd6758d-zt57t    2/2     Running     0          12d
granite-3-3-8b-instruct-epp-597c886f78-jg5m5       1/1     Running     0          12d
granite-3-3-8b-instruct-prefill-54856c7898-vwb52   1/1     Running     0          12d

A couple notes on the architecture: both the prefill and decode model server deployments are powered with vllm which can be seen by the startup logs of the containers in each of the pods:

$ kubectl logs -n llama-3-1-8b-instruct  llama-3-1-8b-instruct-decode-d56d75c57-ndr8v

Defaulted container "vllm" out of: vllm, routing-proxy (init)

INFO 05-15 23:44:17 [__init__.py:239] Automatically detected platform cuda.

INFO 05-15 23:44:19 [api_server.py:1043] vLLM API server version 0.1.dev1+g7a1f25f
…

INFO 05-15 23:44:26 [config.py:713] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
$ kubectl logs -n llama-3-1-8b-instruct  llama-3-1-8b-instruct-prefill-5d9796c8c-mqvvk

INFO 05-15 23:44:14 [__init__.py:239] Automatically detected platform cuda.

INFO 05-15 23:44:15 [api_server.py:1043] vLLM API server version 0.1.dev1+g7a1f25f

…

INFO 05-15 23:44:22 [config.py:713] This model supports multiple tasks: {'generate', 'score', 'reward', 'embed', 'classify'}. Defaulting to 'generate'.

VLLM employs state of the art memory management techniques and continuous batch request processing strategies to enable fast model response times while efficiently using the hardware/accelerators it runs on top of. In some published calculations: vLLM achieves up to 24x higher throughput compared to other popular model serving libraries. The efficiency and performance make it a perfect choice for powering environments serving large scales of requests across multiple models.

Prefill and Decode Disaggregation

When looking at the live deployment of the llama model: notice there are two independent pods distributed across the cluster working together in a peer to peer fashion: the decode llama model pod and the prefill llama model pod. Enabling dedicated pools of workload to execute the prefill workflow and decode workflow in inferencing enables employing different parallel processing strategies in the precode phase of inferencing and the decode phase of inferencing. Beyond strategies: precode and decode workload for a given model can be ran on separate machines tailored to processing that step of the overall inferencing process. Ultimately: this deployment strategy results in faster time to first token and faster inter token latencies whereas in typical single instance deployments complex tradeoffs typically have to be considered between the two targets. 

KV (key-value) Cache Offloading, based on LMCache

Within the model deployment as well: LMCache is utilized to form a distributed KV cache solution that is able to distribute segments of the KV cache across redis and members of the prefill and decode running the model enabling efficient utilization of all accelerator memory across the cluster. Additionally: these processes can be seen running within VLLM in the logs of the decode and prefill pods:

$ kubectl logs -n llama-3-1-8b-instruct  llama-3-1-8b-instruct-prefill-5d9796c8c-mqvvk | grep lmcache

[2025-05-15 23:44:31,318] LMCache INFO: Loading LMCache config file /vllm-workspace/lmcache-prefiller-config.yaml (utils.py:32:lmcache.integration.vllm.utils)

[2025-05-15 23:44:31,321] LMCache INFO: Creating LMCacheEngine instance vllm-instance (cache_engine.py:467:lmcache.experimental.cache_engine)

[2025-05-15 23:44:31,322] LMCache INFO: Creating LMCacheEngine with config: LMCacheEngineConfig(chunk_size=256, local_cpu=False, max_local_cpu_size=0, local_disk=None, max_local_disk_size=0, remote_url=None, remote_serde=None, save_decode_cache=False, enable_blending=False, blend_recompute_ratio=0.15, blend_min_tokens=256, blend_special_str=' # # ', enable_p2p=False, lookup_url=None, distributed_url=None, error_handling=False, enable_controller=False, lmcache_instance_id='lmcache_default_instance', controller_url=None, lmcache_worker_url=None, enable_nixl=True, nixl_role='sender', nixl_peer_host='llama-3-1-8b-instruct-service-decode.llama-3-1-8b-instruct.svc.cluster.local', nixl_peer_port=55555, nixl_buffer_size=524288, nixl_buffer_device='cuda', nixl_enable_gc=True) (cache_engine.py:73:lmcache.experimental.cache_engine)

[2025-05-15 23:44:35,435] LMCache INFO: Received remote transfer descriptors (nixl_connector_v2.py:212:lmcache.experimental.storage_backend.connector.nixl_connector_v2)

[2025-05-15 23:44:35,435] LMCache INFO: Initializing usage context. (usage_context.py:235:lmcache.usage_context)
$ kubectl logs -n llama-3-1-8b-instruct llama-3-1-8b-instruct-decode-d56d75c57-ndr8v | grep lmcache

[2025-05-15 23:44:35,277] LMCache INFO: Loading LMCache config file /vllm-workspace/lmcache-decoder-config.yaml (utils.py:32:lmcache.integration.vllm.utils)

[2025-05-15 23:44:35,280] LMCache INFO: Creating LMCacheEngine instance vllm-instance (cache_engine.py:467:lmcache.experimental.cache_engine)

[2025-05-15 23:44:35,280] LMCache INFO: Creating LMCacheEngine with config: LMCacheEngineConfig(chunk_size=256, local_cpu=False, max_local_cpu_size=0, local_disk=None, max_local_disk_size=0, remote_url=None, remote_serde=None, save_decode_cache=False, enable_blending=False, blend_recompute_ratio=0.15, blend_min_tokens=256, blend_special_str=' # # ', enable_p2p=False, lookup_url=None, distributed_url=None, error_handling=False, enable_controller=False, lmcache_instance_id='lmcache_default_instance', controller_url=None, lmcache_worker_url=None, enable_nixl=True, nixl_role='receiver', nixl_peer_host='0.0.0.0', nixl_peer_port=55555, nixl_buffer_size=524288, nixl_buffer_device='cuda', nixl_enable_gc=True) (cache_engine.py:73:lmcache.experimental.cache_engine)

[2025-05-15 23:44:35,429] LMCache INFO: Sent local transfer descriptors to sender (nixl_connector_v2.py:223:lmcache.experimental.storage_backend.connector.nixl_connector_v2)

[2025-05-15 23:44:35,430] LMCache INFO: Initializing usage context. (usage_context.py:235:lmcache.usage_context)

IBM Cloud VPC’s highly performant unmetered private network backbone in combination with the NVIDIA NIXL library ensures fast and secure communications between peer to peer pods and redis further enhancing overall throughput and inference latencies. 

AI Aware Network Routing

To optimally route requests across the distributed model network: llm-d employs controllers built on top of the gateway api inference extension to handle intelligently routing requests to the appropriate prefill or decode model pod that contains the hottest KV cache to ensure efficient request processing. The core pieces of this system consist of a inference API gateway and a model endpoint picker. These pods are shown below along with the load balancer associated with the inference API gateway:

$ kubectl get service -n llm-d inference-gateway
NAME                TYPE           CLUSTER-IP      EXTERNAL-IP                           PORT(S)         AGE
inference-gateway   LoadBalancer   172.21.237.63   89ab7fc6-us-east.lb.appdomain.cloud   443:31297/TCP   14d


$ kubectl get pods -n llm-d -l app.kubernetes.io/instance=inference-gateway
NAME                                 READY   STATUS    RESTARTS   AGE
inference-gateway-74c676f884-4dcpf   1/1     Running   0          12d


$ kubectl get pods -n llama-3-1-8b-instruct | grep llama-3-1-8b-instruct-epp
llama-3-1-8b-instruct-epp-85484474d-wpqrl       1/1     Running     0          3h15m

The endpoint picker (epp) is continuously monitoring the state of the distributed model deployment along with KV cache metrics associated with each instance in the deployment. The inference api gateway when it has a request sends metadata about the request to the epp pod which determines the optimal model instance in the decode or prefill pool to process the request. This intelligent routing system helps keep overall request latency down, throughput high, and avoid resource spikes with vllm instances that potentially degrade serving requests for the llama model.

Inferencing with Deployed Models

Now that we have done a dive into the backend architecture by looking at a live deployment of a llama model: let’s show how we can inference with both the granite and the llama model running in the same cluster. We will interact with both models through curl requests and using the popular LLM chat provider: AnythingLLM. With the following curl request we will ask the llama model the question of: 

"Do you enjoy reading technical blog posts of state of the art AI solutions?"

and receive the answer:

"I can process and analyze technical blog posts about state-of-the-art AI solutions. I don't have personal preferences or emotions, but I can provide information and insights about various AI technologies.

If you're interested in discussing or summarizing technical blog posts about AI, I'd be happy to help. Please share the title or a brief summary of the blog post, and I'll do my best to provide a summary, analysis, or insights about the topic."

cat >"/tmp/payload.json" <<EOF
{
   "model": "Llama-3.1-8B-Instruct",
   "messages": [
    {
      "role": "user",
      "content": "Do you enjoy reading technical blog posts of state of the art AI solutions?"
    }
   ]
}
EOF

curl  -H "Content-Type: application/json" -H "Authorization: Bearer API_KEY" -X POST https://llama-3-1-8b-instruct.vllmd-test-cluster-80d128fecd199542426020c17e5e9430-0002.us-east.containers.appdomain.cloud/v1/chat/completions -d @/tmp/payload.json

{"choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"I can process and analyze technical blog posts about state-of-the-art AI solutions. I don't have personal preferences or emotions, but I can provide information and insights about various AI technologies.\n\nIf you're interested in discussing or summarizing technical blog posts about AI, I'd be happy to help. Please share the title or a brief summary of the blog post, and I'll do my best to provide a summary, analysis, or insights about the topic.","reasoning_content":null,"role":"assistant","tool_calls":[]},"stop_reason":null}],"created":1747365656,"id":"chatcmpl-a1bd676c-3867-40ad-beb9-8c4f2b81952a","model":"Llama-3.1-8B-Instruct","object":"chat.completion","prompt_logprobs":null,"usage":{"completion_tokens":92,"prompt_tokens":52,"prompt_tokens_details":null,"total_tokens":144}}

The same question asked to the granite model receives the following response: 

"While I don't have personal feelings, I can certainly help you find and summarize information on state-of-the-art AI solutions from technical blog posts. I can provide insights, compare different approaches, and explain complex concepts in simpler terms."

cat >"/tmp/payload.json" <<EOF
{
   "model": "granite-3.3-8b-instruct",
   "messages": [
    {
      "role": "user",
      "content": "Do you enjoy reading technical blog posts of state of the art AI solutions?"
    }
   ]
}
EOF

curl  -H "Content-Type: application/json" -H "Authorization: Bearer API_KEY" -X POST https://granite-3-3-8b-instruct.vllmd-test-cluster-80d128fecd199542426020c17e5e9430-0001.us-east.containers.appdomain.cloud/v1/chat/completions -d @/tmp/payload.json

{"choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"While I don't have personal feelings, I can certainly help you find and summarize information on state-of-the-art AI solutions from technical blog posts. I can provide insights, compare different approaches, and explain complex concepts in simpler terms.","reasoning_content":null,"role":"assistant","tool_calls":[]},"stop_reason":null}],"created":1747365786,"id":"chatcmpl-fbd2df65-0cb4-4b5e-aa2d-b35fce710354","model":"granite-3.3-8b-instruct","object":"chat.completion","prompt_logprobs":null,"usage":{"completion_tokens":51,"prompt_tokens":76,"prompt_tokens_details":null,"total_tokens":127}}

Now that we have successfully sent requests to both models with curl: let's show how AnythingLLM can provide a chat UI interface with deployed llm-d models! First: we will create a new workspace in AnythingLLM called llm-d-llama-3-1-8b-instruct. Then we will go to settings of that workspace, then chat settings and  then under workspace LLM Provider select “Generic OpenAI” provider. We then enter the following information for the base llama model. Note that the urls and generated apikeys will vary for every cluster and model deployment. This guide shows examples from a temporary test environment.

Base URL: https://llama-3-1-8b-instruct.vllmd-test-cluster-80d128fecd199542426020c17e5e9430-0002.us-east.containers.appdomain.cloud/v1
API Key: <API_KEY_GENERATED_FOR_MODEL>
Chat Model Name: Llama-3.1-8B-Instruct
Token context window: 131072
Max Tokens: 1024

Then click Save Settings. With that done we can now ask the same question and get a similar response from the model in the AnythingLLM UI (screenshot shown below):

We will follow the same steps with a new granite-3-3-8b-instruct workspace with the following parameters

Base URL: https://granite-3-3-8b-instruct.vllmd-test-cluster-80d128fecd199542426020c17e5e9430-0001.us-east.containers.appdomain.cloud/v1
API Key: <API_KEY_GENERATED_FOR_MODEL>
Chat Model Name: granite-3.3-8b-instruct
Token context window: 131072
Max Tokens: 1024

We will see after initializing the workspace that the granite model will send a similar response.

Conclusion

We have walked through the internals of how llm-d enables high scale efficient inferencing for multiple models by examining a deployment of it on IBM Cloud Red Hat Openshift. IBM is excited to continue to work with community to publish further specific performance metrics and benchmarks showing the power of llm-d within IBM Cloud. A video of the same content written about in this blog is available here. Our next blog will dive into the specific steps to deploy llm-d within a Red Hat Openshift on IBM Cloud Cluster.

Special thanks to Kodie Glosser who assisted in the deployment of llm-d on IBM Cloud Red Hat Openshift.

0 comments
1 view

Permalink