File and Object Storage

File and Object Storage

Software-defined storage for building a global AI, HPC and analytics data platform 

 View Only

Deploying Distributed LLM Inference Service with IBM Storage Scale for KV Cache offloading

By Yue Zhu posted 20 days ago

  

Authors: @Yue Zhu @Radu Stoica @Animesh Trivedi @Jonathan Terner @Frank Schmuck @Jeremy Cohn @Christof Schmitt @Anthony Hsu @Guy Margalit @Vasily Tarasov @Swaminathan Sundararaman @Vincent Hsu

Introduction

Large Language Models (LLMs) are increasingly deployed in production environments for use cases like chat bots, document summarization, and code generation. The majority of modern LLMs (like Meta Llama or IBM Granite) are built using transformer architecture and hence produce and maintain large runtime state while performing the inference. Imagine a prompt of multiple input tokens fed to a model by a user. For each token in the input prompt, the model produces multiple Key (K) and Value (V) tensors (known as “KV Cache”) representing an intermediate computation state which is stored in the GPU memory during the inference. These tensors are then used by the model to compute the prediction for the next output token. Each output token is generated one by one, and all previously computed KV tensors are continuously reused. In the simplest case, when all output tokens for a given prompt are generated, KV tensors could be discarded. If later the same user continues the conversation with the model later, KV tensors need to be recomputed from scratch - from the very beginning of the whole prompt. Experiments show that computing KVs for 128,000 tokens takes almost 20 seconds when running Llama3-70b model on four H100 GPUs, which is not an acceptable response time for many inference use cases.

To reduce such long response time, modern inference engines and distributed frameworks (like vLLM and llm-d) support caching KV tensors in GPU memory, CPU memory, and storage. High-performance distributed storage systems like IBM Storage Scale are particularly appealing for storing KV Cache data: unlike storing KV values in volatile GPU or CPU DRAM, Scale can provide the space that is unlimited and persistent, enables easy KV cache sharing between inference servers, and can save and return tensors with latency and throughput that many inference use cases require. In this blog post we describe in detail (and reference necessary external documentation) how to configure an inference stack based on llm-d to use IBM Storage Scale for offloading KV Cache data. We assume that the environment already has Storage Scale’s storage cluster deployed, but no setup is done on GPU servers except of a typical Kubernetes deployment (which is required by llm-d).

The following diagram illustrates our target distributed inference setup. In green are the components that we assume are already installed and configured (Kubernetes on GPU servers and Scale Storage Cluster). In blue are the components that we will describe how to deploy. At a high level, the GPU cluster will run Kubernetes, llm-d, and IBM Storage Scale client that remotely mounts the filesystem and accesses data from the storage cluster. llm-d runs vLLM engine on individual GPU servers and uses LMCache with its FS connector to offload KV Cache to the mounted Scale file system.

IBM Storage Scale Setup

IBM Storage Scale is a general-purpose, high-performance file system that can be deployed in many configurations depending on the environment and target use caseIn this blog we focus on one of the popular setups - disaggregated setup where a storage cluster is separated from the compute cluster where user applications (inference in our case) are running. The storage cluster may be implemented either using high-performance IBM Storage Systems appliance or by deploying software-defined IBM Storage Scale over standard storage-rich servers with direct-attached drives.

The compute cluster requires a separate Scale installation, which can be achieved in several ways. Since the compute cluster runs Kubernetes, one option is to deploy IBM Storage Scale Container Native edition - in this case Scale daemons run in pods, Scale management is done through Custom Resources (CRs), and an appropriate CSI driver is automatically deployed to enable dynamic Persistent Volume (PV) provisioning. To deploy IBM Storage Scale Container Native, follow this documentation. This is our recommended approach because it provides the most flexible and automated way to manage PVs backed by the Scale filesystem.

An alternative option is to deploy Scale directly on the hosts (“under Kubernetes”) and use it together with IBM Storage Scale CSI driver. In this case, it manages Scale cluster using traditional non-container-native approach but still can rely on dynamic persistent volume provisioning process in Kubernetes. Follow this documentation to install Scale on the hosts and this documentation to then deploy IBM Storage Scale CSI driver. Even simpler, but less flexible, configuration option is to skip CSI driver installation and manually mount remote Scale filesystem on the hosts. Then use `hostPath` functionality in Kubernetes to statically provision appropriate Persistent Volumes (PVs).

No matter which of the three approaches you take for Scale client cluster deployment, in the end a user needs to create a Persistent Volume Claim (PVC) for storing and loading KV Cache data by llm-d. To reap the benefits of a shared KV cache, this PVC should support ReadWriteMany (RWX) access mode. This mode enables a vLLM instance running in one pod on one server access the KV Cache generated by the vLLM running in the other pod on a different server.

We would like to note that even though the default parameter values are a typically good start to get decent performance, Scale can often be fine-tuned for specific setups and workloads. For example, increasing the pagepool size could help improve the throughput, or if your network has multiple paths between client and server, increasing the number of TCP connections per node can help utilize additional bandwidth. In future posts we intend to expand on tuning Scale for AI inference workloads.

llm-d Setup

In this section, we describe how to deploy llm-d inference framework and run llama-3.1-70B-Instruct model in it. We follow the deployment guides located in the official llm-d repository under guides/inference-scheduling/ and customized the configuration for our needs.

First, ensure you have addressed necessary prerequisites. These include infrastructure prerequisites, configuring gateway control plane, setting up your Hugging-Face secret, and possibly deploying the monitoring stack. Verify that your gateway is up and running with the following command:

kubectl api-resources --api-group=inference.networking.k8s.io
NAME             SHORTNAMES   APIVERSION                       NAMESPACED   KIND
inferencepools   infpool      inference.networking.k8s.io/v1   true         InferencePool

The configuration of LLM-d is largely confined to a single YAML file - values.yaml - an example of which is located in guides/inference-scheduling/ms-inference-scheduling/values.yaml. After our customizations, the content of this file looks like this:

multinode: false

modelArtifacts:
  uri: "hf://meta-llama/Llama-3.1-70B-Instruct"
  name: "meta-llama/Llama-3.1-70B-Instruct"
  size: 256Gi
  authSecretName: "llm-d-hf-token"

routing:
  servicePort: 8000
  proxy:
    image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.3.0
    connector: nixlv2
    secure: false

  inferencePool:
    create: false

  httpRoute:
    create: false

  epp:
    create: false


decode:
  create: true
  replicas: 1
  parallelism:
    tensor: 4
  monitoring:
    podmonitor:
      enabled: true
      portName: "metrics"  # decode vLLM service port (from routing.proxy.targetPort)
      path: "/metrics"
      interval: "30s"
  containers:
  - name: "vllm"
    image: lmcache/vllm-openai:v0.3.9
    modelCommand: custom
    command:
      - /bin/sh
      - '-c'
    args:
      - |
        source /opt/venv/bin/activate
        vllm serve meta-llama/Llama-3.1-70B-Instruct \
        --host 0.0.0.0 \
        --tensor-parallel-size 4 \
        --port 8200 \
        --gpu-memory-utilization 0.8 \
        --prefix-caching-hash-algo sha256_cbor \
        --enforce-eager \ 
        --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
    env:
      - name: GAIE_RELEASE_NAME_POSTFI
      - name: NAMESPACE
        valueFrom:
          fieldRef:
            fieldPath: metadata.namespace # assumed to be the same as the EPP's
      - name: PYTHONHASHSEED
        value: "42"
      - name: POD_IP
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: status.podIP
      - name: UCX_TLS
        value: "cuda_ipc,cuda_copy,tcp"
      - name: VLLM_NIXL_SIDE_CHANNEL_HOST
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: VLLM_NIXL_SIDE_CHANNEL_PORT
        value: "5557"
      - name: LMCACHE_USE_EXPERIMENTAL
        value: "True"
      - name: LMCACHE_CHUNK_SIZE
        value: "4092"
      - name: LMCACHE_LOCAL_CPU
        value: "False"
      - name: LMCACHE_MAX_LOCAL_CPU_SIZE
        value: "64"
      - name: LMCACHE_REMOTE_URL
        value: "fs://localhost:6379/kvc-dir/lmcache"
      - name: LMCACHE_REMOTE_SERDE
        value: "naive"
    ports:
      - containerPort: 5557
        protocol: TCP
      - containerPort: 8200
        name: metrics
        protocol: TCP
    resources:
      limits:
        nvidia.com/gpu: "4"
      requests:
        nvidia.com/gpu: "4"
    mountModelVolume: true
    volumeMounts:
    - name: metrics-volume
      mountPath: /.config
    - name: torch-compile-cache
      mountPath: /.cache
    - name: kvc-dir
      mountPath: /kvc-dir
  volumes:
  - name: metrics-volume
    emptyDir: {}
  - name: torch-compile-cache
    emptyDir: {}
  - name: kvc-dir
    persistentVolumeClaim:
      claimName: kvc-dir


# PD disabled
prefill:
  create: false

We will now go over most relevant parts of this file and explain the rationale behind them.

Under modelArtifacts section (line 3) we declare a new model named meta-llama/Llama-3.1-70B-Instruct and defined that it is located in hf://meta-llama/Llama-3.1-70B-Instruct Hugging Face repository. If the model is already pre-staged in a PVC, one can switch to a PVC-based URI using pvc:// prefix.

Inference process consists of two stages: 1) prefill, when K and V tensors for the input tokens are computed and 2) decode, when based on the K and V values the prediction of the output token is calculated. Latest research work shows that disaggregating prefill workers from decode workers could be beneficial for inference efficiency. LLM-d supports disaggregated mode as an experimental feature. We decide to experiment with the stable setup without disaggregation and leave the disaggregated setup for a later stage (by setting create to false under prefill section, line 118). In such configuration decoder worker is responsible for both prefill and decode.

We leverage vLLM with LMCache KVCache connector to offload KV Cache to the IBM Storage Scale (i.e., distributed high-performance file system) to achieve our goals. We decide to use vLLM container images that LMCache community builds (lmcache/vllm-openai:v0.3.9 in line 39) which include both vLLM and LMCache functionality. To enable LMCache, we modify the command line that llm-d uses when starting vLLM container. To achieve that, we set modelCommand to custom (line 40) and set appropriate command arguments in lines 41-54. Notably, we enable LMCache (--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}', line 54).

The configuration of LMCache is done by passing environmental variables to vLLM container, we do it lines 76-87. Notably, we set 1) LMCACHE_REMOTE_URL to fs://localhost:6379/kvc-dir/lmcache - this is the path where LMCache will be storing offloaded KV Cache data and 2) LMCACHE_CHUNK_SIZE to 4092 - resulting into KV Cache file sizes to fit Scale’s 16 MiB block size (default) for efficient I/O and 3) PYTHONHASHSEED to 42, which ensures all vLLM instances use a consistent seed for generating filenames, enabling KV Cache reuse under tensor parallelism. We also need to mount the kvc-dir PVC in the vLLM pods. This is done by specifying appropriate volume mount in line 305 and volumes in line 112.

Llama-70B model requires significant compute and memory resources, we recommend the GPU allocation to use four GPUs per decode pod (line 94-95). If you use the CUDA_VISIBLE_DEVICES environment variable in your deployment, ensure it is set to CUDA_VISIBLE_DEVICES=0,1,2,3 so that all four GPUs are available for vLLM.

Starting and Verifying llm-d

Under llm-d repo, we start llm-d via commands below. Before you start, ensure you have updated your setup in ms-inference-scheduling/values.yaml

cd guides/inference-scheduling
export NAMESPACE=ibm-scale-llmd-demo
helmfile apply -n ${NAMESPACE}

To verify that the deployment is working, first list all helm releases to view the 3 charts got installed into your chosen namespace:

helm list -n ${NAMESPACE}
NAME           	NAMESPACE    	REVISION	UPDATED                                	STATUS  	CHART                     	APP VERSION
gaie-kv-events 	llm-d-precise	1       	2025-11-11 20:26:17.211764522 +0000 UTC	deployed	inferencepool-v1.0.1      	v1.0.1
infra-kv-events	llm-d-precise	1       	2025-11-11 20:26:16.405697947 +0000 UTC	deployed	llm-d-infra-v1.3.3        	v0.3.0
ms-kv-events   	llm-d-precise	1       	2025-11-11 20:26:18.66475051 +0000 UTC 	deployed	llm-d-modelservice-v0.2.16	v0.2.0

Then check available resources under ${NAMESPACE} and example below:

kubectl get all -n ${NAMESPACE}
NAME                                                           READY   STATUS    RESTARTS   AGE
pod/gaie-kv-events-epp-84c54f549f-jm7gj                        1/1     Running   0          6m4s
pod/infra-kv-events-inference-gateway-istio-865d6f9f85-x6rg2   1/1     Running   0          6m6s
pod/ms-kv-events-llm-d-modelservice-decode-6c695c58f7-gtj8x    2/2     Running   0          6m3s

NAME                                              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
service/gaie-kv-events-epp                        ClusterIP   10.43.53.246    <none>        9002/TCP,9090/TCP,5557/TCP   6m4s
service/gaie-kv-events-ip-805c964d                ClusterIP   None            <none>        54321/TCP                    6m4s
service/infra-kv-events-inference-gateway-istio   ClusterIP   10.43.68.22     <none>        15021/TCP,80/TCP             6m6s

NAME                                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gaie-kv-events-epp                        1/1     1            1           6m4s
deployment.apps/infra-kv-events-inference-gateway-istio   1/1     1            1           6m6s
deployment.apps/ms-kv-events-llm-d-modelservice-decode    1/1     1            1           6m3s

NAME                                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/gaie-kv-events-epp-84c54f549f                        1         1         1       6m4s
replicaset.apps/infra-kv-events-inference-gateway-istio-865d6f9f85   1         1         1       6m6s
replicaset.apps/ms-kv-events-llm-d-modelservice-decode-6c695c58f7    1         1         1       6m3s

Lastly, verify the success of the deployment by querying the available model from llm-d service end points:

kubectl port-forward -n ${NAMESPACE} service/infra-kv-events-inference-gateway-istio 8000:80
curl http://localhost:8000/v1/models | jq .

Example output:

{
  "data": [
    {
      "created": 1762893014,
      "id": "meta-llama/Llama-3.1-70B-Instruct",
      "max_model_len": 131072,
      "object": "model",
      "owned_by": "vllm",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sampling": true,
          "allow_search_indices": false,
          "allow_view": true,
          "created": 1762893014,
          "group": null,
          "id": "modelperm-8382017e263443a890dc30f69c0677aa",
          "is_blocking": false,
          "object": "model_permission",
          "organization": "*"
        }
      ],
      "root": "meta-llama/Llama-3.1-70B-Instruct"
    }
  ],
  "object": "list"
}

If curl produces no output, you may be missing the httproute resources. You can verify and apply them using the commands below.

kubectl get httproute -n ${NAMESPACE}
NAME              HOSTNAMES   AGE
llm-d-kv-events               11d
kubectl apply -f httproute.yaml -n ${NAMESPACE}

Conclusion

In this post, we explored how IBM Storage Scale can serve as a high-performance distributed file system for llm-d by configuring Scale as a remote storage cluster. With this setup, llm-d can easily use Scale to stage KV Cache data, ensuring efficient access and reuse of KVCache across nodes. . We also highlighted a couple of Scale parameters that can be tailored to specific workloads and cluster topologies. Stay tuned for future deeper dives on Scale performance tuning and advanced features optimized for large-scale inference workloads.

0 comments
49 views

Permalink