Data and AI on Power

Data and AI on Power

IBM Power systems provide a robust and scalable platform for a wide range of data and AI workloads, offering benefits in performance, security, and ease of use.

 View Only

How to Run a LLM Model on IBM Power10 using llama.cpp

By Amrita H S posted Mon February 24, 2025 01:46 AM

  

This blog presents the steps required to run inferencing with llama.cpp on IBM Power10 systems using a Large Language Model. 

llama.cpp is a C/C++ library that efficiently processes the GGML formatted models, facilitating the execution of LLM like LLaMa, Vicuna or Wizard on personal computers without requiring a gpu. Though the library is optimized for cpu usage, it also supports gpu acceleration using various BLAS backends like OpenBLAS. 

llama.cpp works by loading the GGML formatted model, creating a computed graph from the loaded model, tokenize the prompt and feed it to the computed graph using a loop. This will result in generation of a new token each iteration, using top-K and top-P sampling algorithm. The prompt will be updated with the new token, and this updated prompt will be used in the next iteration.

Project Setup:

  • Prerequisites

gcc-toolset-13 is the minimum requirement to build llama.cpp. Enable gcc-toolset-13 and add it to the path.

scl enable gcc-toolset-13 bash

source scl_source enable gcc-toolset-13

export PATH=/opt/rh/gcc-toolset-13/root/usr/bin/:$PATH

  • Build llama.cpp from sources
    • Download sources using git clone

MMA optimizations for float and quantized int8 datatypes are merged into master. So directly clone the master branch of llama.cpp git repository.

           git clone https://github.com/ggerganov/llama.cpp.git

         If you are interested in a different branch other than master, then make sure that MMA optimization patch is present.

    • Use cmake to build llama.cpp.

cmake -B build_llama

cmake –-build build_llama –-config Release

    • To obtain debug build for llama.cpp:
         
      cmake -B build_llama -DCMAKE_BUILD_TYPE=Debug

               cmake --build build_llama

  • Download LLM models

If you already have the model downloaded somewhere, just copy it into ‘models’ directory in llama.cpp sources.

If not, follow the steps given in the following link to download the models from Hugging Face. If model file is gguf format, download that and copy to ‘models’ folder. Otherwise download the entire folder and convert to GGUF format as mentioned in the below link.
https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#obtaining-and-quantizing-models

  • Convert the LLM model into GGUF format
    • torch is a required package to convert hugging face models to GGUF format. Make sure that torch package is available in your environment.
    • Install rest of the dependencies listed in requirements.txt

python3 -m pip install -r requirements.txt

    • Use convert_hf_to_gguf.py to convert the models.
      • To convert into a GGUF model in float format:

python3 convert_hf_to_gguf.py models/mistral_models/7B-Instruct-v0.3/ --outtype f32

      • To convert into a GGUF model in half precession float format:

python3 convert_hf_to_gguf.py models/mistral_models/7B-Instruct-v0.3/ --outtype f16

bf16, q8_0, tq1_0, tq2_0 are the other supported data types.

  • Quantize the models into different quantized formats

If you want to run the quantized models, using below steps. But if you plan to execute the non-quantized models, you can skip this step.

    • To quantize the models into 4 bits (using Q4_0)

./build_llama/bin/llama-quantize ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-F16.gguf ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-Q4_0.gguf Q4_0

    • To quantize the models into 8 bits (using Q8_0)

./build_llama/bin/llama-quantize ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-F16.gguf ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-Q8_0.gguf Q8_0

Execution:

Once GGUF model file is obtained, run the model using below command:

  • To run 8bit quantized model:

OMP_NUM_THREADS=80 OMP_PLACES="{0}:80:1" ./build_llama/bin/llama-cli -m ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-Q8_0.gguf -n 256 -p 'Please write a python program that calculates the nth fibonacci number' -t 80

  • To run float model:

OMP_NUM_THREADS=80 OMP_PLACES="{0}:80:1" ./build_llama/bin/llama-cli -m ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-F32.gguf -n 256 -p 'Please write a python program that calculates the nth fibonacci number' -t 80

llama.cpp provides a microbenchmark llama-batched-bench which can be used to evaluate the performance.

OMP_NUM_THREADS=80 OMP_PLACES="{0}:80:1" build_llama/bin/llama-batched-bench -m models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-F32.gguf   -c 262144 -b 2048 -ub 512 -npp 32,64,128,512 -ntg 32,64,128,512 -npl 1,2,4,8 -t 80

Note: Set OMP_NUM_THREADS and ‘t’ to an optimal value, which is half the number of physical cores in the system, to get the maximum performance.

Verify the usage of MMA instructions

  • Confirm the presence of MMA instructions (xvf32gerpp for float and xvi8ger4pp for int8) in the library.

$cd build_llama

$find . -name libggml-cpu.so ./ggml/src/libggml-cpu.so

$objdump -D build_llama/ggml/src/libggml-cpu.so|grep xvf | wc -l

136 

$objdump -D build_llama/ggml/src/libggml-cpu.so|grep xvi8 | wc -l

192

  • Collect profile when running a use case and confirm the usage of MMA instructions, using perf stat command with r1000E.

perf stat -e r1000E ./build_llama/bin/llama-cli -m ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-Q8_0.gguf -n 256 -p 'Please write a python program that calculates the nth fibonacci number'

Performance counter stats for 'build_llama/bin/llama-cli -m ./models/Meta-Llama-3-8B/ggml-model-Q8_0.gguf -n 256 -p Please write a python program that calculates the  nth fibonacci number':

2,580,391,581     r1000E:u

33.562437810 seconds time elapsed

131.813960000 seconds user

0.330160000 seconds sys

  • Optional: Use the Linux perf library to view and analyse the model execution profile. Use the perf command to record and report the execution profile:

$ perf record -g ./build_llama/bin/llama-cli -m ./models/Meta-Llama-3-8B/ggml-model-Q8_0.gguf -n 256 -p 'Please write a python program that calculates the nth fibonacci number'

$perf report -g –sort dso perf.data

Refer to the following screenshot for a sample execution profile for the  run.

0 comments
94 views

Permalink