Introduction
Large Language Models (LLMs) are revolutionizing various fields, and vLLM emerges as a powerful library for LLM inference and serving. Great news for users with ppc64le hardware! Recent developments (https://github.com/vllm-project/vllm/pull/5652) indicate that vLLM now added support for this architecture. This blog outlines the steps to get started with vLLM on ppc64le.
Build the container image:
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm/
$ podman build --security-opt label=disable --format docker -t vllm:ppc64le -f Dockerfile.ppc64le .
Using the Built Image:
# listing the images built
$ podman images
REPOSITORY TAG IMAGE ID CREATED SIZE
localhost/vllm ppc64le 9a44a2021b41 38 minutes ago 4.32 GB
<none> <none> a80bc1de136c 43 minutes ago 2.51 GB
docker.io/mambaorg/micromamba latest 358d7e727885 9 days ago 137 MB
$
# creating the directory for caching the models from the huggingface
$ mkdir -p ~/.cache/huggingface
$ podman run -ti -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host localhost/vllm:ppc64le
Above command starts the server with the default model(facebook/opt-125m) pulled from hugginface and this server can be queried in the same format as OpenAI API. For example,
List the models:
$ curl http://localhost:8000/v1/models | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 484 100 484 0 0 236k 0 --:--:-- --:--:-- --:--:-- 236k
{
"object": "list",
"data": [
{
"id": "facebook/opt-125m",
"object": "model",
"created": 1719484379,
"owned_by": "vllm",
"root": "facebook/opt-125m",
"parent": null,
"max_model_len": 2048,
"permission": [
{
"id": "modelperm-8ba9ddf949764d359f2db7eb1fa92090",
"object": "model_permission",
"created": 1719484379,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
Chat completion:
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}' | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 443 100 308 100 135 306 134 0:00:01 0:00:01 --:--:-- 440
{
"id": "cmpl-f77c7f6e64df4221836d85d64d28ae04",
"object": "text_completion",
"created": 1719484486,
"model": "facebook/opt-125m",
"choices": [
{
"index": 0,
"text": " great place to live. I",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 5,
"total_tokens": 12,
"completion_tokens": 7
}
}
For more information on the usage please refer the vllm document - https://docs.vllm.ai/en/stable/index.html
#Featured-area-1
#Featured-area-1-home