Data and AI on Power

 View Only

Run vLLM on ppc64le Architecture

By Manjunath Kumatagi posted 5 days ago

  

Introduction

Large Language Models (LLMs) are revolutionizing various fields, and vLLM emerges as a powerful library for LLM inference and serving. Great news for users with ppc64le hardware! Recent developments (https://github.com/vllm-project/vllm/pull/5652) indicate that vLLM now added support for this architecture. This blog outlines the steps to get started with vLLM on ppc64le.

Build the container image:

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm/
$ podman build --format docker -t vllm:ppc64le -f Dockerfile.ppc64le .

Using the Built Image:

# listing the images built
$ podman images
REPOSITORY                     TAG         IMAGE ID      CREATED         SIZE
localhost/vllm                 ppc64le     9a44a2021b41  38 minutes ago  4.32 GB
<none>                         <none>      a80bc1de136c  43 minutes ago  2.51 GB
docker.io/mambaorg/micromamba  latest      358d7e727885  9 days ago      137 MB
$
# creating the directory for caching the models from the huggingface
$ mkdir -p ~/.cache/huggingface
$ podman run -ti -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host localhost/vllm:ppc64le

Above command starts the server with the default model(facebook/opt-125m) pulled from hugginface and this server can be queried in the same format as OpenAI API. For example,

List the models:

$ curl http://localhost:8000/v1/models | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   484  100   484    0     0   236k      0 --:--:-- --:--:-- --:--:--  236k
{
  "object": "list",
  "data": [
    {
      "id": "facebook/opt-125m",
      "object": "model",
      "created": 1719484379,
      "owned_by": "vllm",
      "root": "facebook/opt-125m",
      "parent": null,
      "max_model_len": 2048,
      "permission": [
        {
          "id": "modelperm-8ba9ddf949764d359f2db7eb1fa92090",
          "object": "model_permission",
          "created": 1719484379,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Chat completion:

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "facebook/opt-125m",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   443  100   308  100   135    306    134  0:00:01  0:00:01 --:--:--   440
{
  "id": "cmpl-f77c7f6e64df4221836d85d64d28ae04",
  "object": "text_completion",
  "created": 1719484486,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "index": 0,
      "text": " great place to live.  I",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 12,
    "completion_tokens": 7
  }
}

For more information on the usage please refer the vllm document - https://docs.vllm.ai/en/stable/index.html

Permalink