Deploying large language models (LLMs) on IBM Power CPUs can require specific package versions, toolchain configuration, and runtime tuning to achieve reliable, high‑throughput inference—especially when running CPU‑only with bfloat16 (bf16). This blog provides a tested, repeatable setup for vLLM on IBM Power11, covering environment creation, dependency installation (including gcc-toolset-14), build and install steps, execution using fmwork, and what to measure to validate performance.
Scope and outcome
By following the setup, installation, execution steps, and recommended configuration settings described in the blog, you will be able to:
- Set up a Python environment on Power11 for vLLM with bf16.
- Install and build the Power‑ready vLLM and dependencies.
- Run a server/client workflow (via fmwork) to exercise inference.
- Collect metrics and confirm stable throughput and predictable behavior on Power11.
Prerequisites
Before you begin, ensure the following are in place:
System and OS
- You must have an IBM Power11 system with a Linux distribution supported on IBM Power (ppc64le), such as Red Hat Enterprise Linux.
- You must have shell access with permissions to modify environment variables and install Python packages.
Language and tools
- You must have Python 3.12 available on the system.
- You must have gcc-toolset-14 installed or accessible because it is the minimum required GCC version on Power11.
- You must have network access to the IBM Python wheels repository: https://wheels.developerfirst.ibm.com/ppc64le/linux.
Resources
- You must have adequate disk space to build and install packages and models.
- You must have stable internet connectivity to clone repositories and download dependencies.
Start by setting up a PyPI environment
Use a dedicated virtual environment to isolate dependencies.
python3.12 -m venv testenv
source testenv/bin/activate && \
pip install --upgrade pip
Then, create a requirements.txt file and copy the following packages into it. Note that these package versions work well with vLLM 0.11.1build.
Before you proceed, review the following note to understand when you may not need to install every library from the list.
Note: Some dependencies (for example: ffmpeg, libprotobuf, openblas) may already be present as system libraries in certain Power environments. The listed versions reflect a tested configuration using IBM‑provided wheels.
abseil-cpp
argon2-cffi-bindings
cachetools
cffi
cmake
dill
datasets
ffmpeg
grpc-cpp
grpcio==1.76.0
h5py==3.13.0
hdf5
httptools
ibm_db
ipykernel
jedi
libprotobuf
libvpx
MarkupSafe
matplotlib
matplotlib-inline
ml-dtypes
mpmath
msgspec
ninja
numba
numpy
onnxruntime
openblas
opencv-python-headless
opus
outlines_core
pandas
pillow
pip
protobuf
psutil
pyarrow==19.0.0
pydantic
pydantic_core
pydantic-extra-types
PyYAML
Pyzmq
regex
scikit-learn
scipy==1.15.3
sentencepiece
setuptools
setuptools-scm
sklearn-pandas
sympy
termcolor
tiktoken
tokenizers
torch==2.8.0
torchaudio==2.8.0
torchvision==0.23.0
transformers
tzdata
wrapt
yarl
Once you create the requirements.txt file, install the packages using the following command:
pip install --prefer-binary -r requirements.txt --extra-index-url=https://wheels.developerfirst.ibm.com/ppc64le/linux
Update gcc-toolset
To successfully build and run vLLM on IBM Power11, you need an updated GCC toolchain because certain dependencies require a modern compiler. The recommended version for Power11 is gcc-toolset-14. This step ensures that your environment uses the correct compiler before proceeding with installation.
After upgrading, verify that your PATH points to the new GCC version 14.
scl enable gcc-toolset-14 bash
source scl_source enable gcc-toolset-14
export PATH=/opt/rh/gcc-toolset-14/root/usr/bin/:$PATH
Set the environment variables
To ensure that libraries and build tools are correctly located during runtime and compilation, you need to configure several environment variables. These variables define paths for Python packages, shared libraries, and compiler settings, as well as vLLM-specific tuning parameters.
Set the following environment variables:
Important
- Set LD_LIBRARY_PATH entries only if your environment does not already provide these libraries system‑wide; over‑specifying library paths can lead to application binary interface (ABI) conflicts or degraded performance.
- Adjust SITE_PACKAGE_PATH if your virtual environment uses lib instead of lib64.
export SITE_PACKAGE_PATH=$VIRTUAL_ENV/lib64/python3.12/site-packages
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$SITE_PACKAGE_PATH/libprotobuf/lib64/:$SITE_PACKAGE_PATH/openblas/lib/:$SITE_PACKAGE_PATH/:$SITE_PACKAGE_PATH/ffmpeg/lib/:$SITE_PACKAGE_PATH/libvpx/lib/:$SITE_PACKAGE_PATH/lame/lib/"
export CMAKE_PREFIX_PATH=$SITE_PACKAGE_PATH/libprotobuf:$CMAKE_PREFIX_PATH
export CC=/opt/rh/gcc-toolset-14/root/usr/bin/gcc
export CXX=/opt/rh/gcc-toolset-14/root/usr/bin/g++
export VLLM_CPU_KVCACHE_SPACE=40
export VLLM_CPU_OMP_THREADS_BIND="auto"
Install vLLM
Once the environment is prepared and the required toolchain is in place, the next step is to install vLLM. This involves cloning the vLLM repository, installing its dependencies, and building it for CPU execution. Use the following commands to complete the installation:
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -r requirements/common.txt
VLLM_TARGET_DEVICE=cpu python3 setup.py install
An example code for running vLLM (using fmwork)
The following example demonstrates server and client parameters, tensor parallel size, dtype, and sequence lengths.
fmwork is used here to orchestrate server/client execution and load generation.
WORKSPACE="/home/user/fmwork/infer/vllm"
MODEL_ROOT="/home/user"
MODEL_NAME="granite-3.3-8b-instruct"
# --- Execution ---
./runner \
--dir_work "$WORKSPACE" \
--mode server \
--model_root "$MODEL_ROOT" \
--model_name "$MODEL_NAME" \
-- \
server \
--env PYTHONUNBUFFERED=1 \
--env VLLM_USE_V1=1 \
--tensor-parallel-size 1 \
--max-num-seqs 16 \
--dtype bfloat16 \
--max-model-len 8192 \
--max-num-batched-tokens 32768 \
-- \
client \
--env PYTHONUNBUFFERED=1 \
--dataset-name random \
--random-input-len 2048 \
--random-output-len 1024 \
--num-prompts 1 \
--max-concurrency 1 \
--ignore-eos
What to record per run
To evaluate performance and ensure reproducibility, it is important to capture key configuration details and metrics for each run. Recording this information will help you compare different setups, identify bottlenecks, and validate tuning changes. Use the following checklist:
- Model and commit: example: Granite-3.3-8B-Instruct, vLLM 0.11.1
- dtype: bf16
- Threading: VLLM_CPU_OMP_THREADS_BIND, SMT level
- Load shape: input/output token lengths, concurrency, batch limits
- Metrics: TTFT, ITL, Throughput (output tokens)
- From nmon: CPU util, context switches, miss rate, average run-queue
Troubleshooting
When working through the installation and configuration steps, you may encounter issues related to toolchain paths, library dependencies, or version mismatches. This section provides quick checks and corrective actions for common problems, helping you resolve errors efficiently and continue with the setup.
Use the following checks if you encounter any errors:
- Toolchain/path issues
Symptom: Build fails or compilers not found
Fix:
gcc --version
which gcc
echo $PATH
Ensure that PATH includes /opt/rh/gcc-toolset-14/root/usr/bin/.
- Library resolution issues
Symptom: Runtime errors about missing directories libprotobuf, openblas, ffmpeg, or libvpx
Fix:
echo $LD_LIBRARY_PATH
Confirm paths include the exported directories. Then, source the virtual environment again and export the variables if needed.
- Python environment conflicts
Symptom: Version mismatches or pip install failures
Fix:
python3.12 -m venv testenv
source testenv/bin/activate
pip install --upgrade pip
Recreate the venv and reinstall from requirements.txt.
- vLLM or Torch version mismatches
Symptom: Import errors or API incompatibility
Fix:
python -c "import torch, sys; print('Torch:', torch.__version__)"
python -c "import vllm; print('vLLM imported OK')"
Verify Torch 2.8.0 and the installed vLLM build.
- Runtime configuration issues
Symptom: Poor throughput or unstable performance
Fix: Adjust VLLM_CPU_OMP_THREADS_BIND and SMT settings; re‑run and compare TTFT/ITL/throughput.
Validation and benchmarking
After completing the installation and running vLLM, it is essential to validate that the setup works as expected and measure the performance. This section outlines key checks, metrics to capture, and commands to confirm reproducibility and benchmark throughput on Power11.
After you run the server and client:
Summary
Running vLLM on IBM Power11 with bf16 is achievable with a repeatable setup that covers environment preparation, toolchain alignment, and CPU‑focused build steps. With the recommended configuration and tuning, you can obtain predictable behavior and measure performance consistently across runs.
Additionally, keep the following critical points in view for a smooth and effective setup:
- Proven setup path on Power11 using Python 3.12 and gcc-toolset-14, ensuring compiler compatibility for the build.
- Executable workflow to install dependencies, build vLLM for CPU, and run a server/client sequence with fmwork.
- Practical tuning controls—notably VLLM_CPU_OMP_THREADS_BIND and SMT settings—to stabilize performance and improve throughput.
- Focused validation guidance (TTFT, ITL, throughput) and optional system metrics to benchmark and compare configurations reliably.
- Targeted troubleshooting to resolve common toolchain, library path, and version issues quickly.
References