This blog presents the steps required to run inferencing with llama.cpp on IBM Power10 systems using a Large Language Model.
llama.cpp is a C/C++ library that efficiently processes the GGML formatted models, facilitating the execution of LLM like LLaMa, Vicuna or Wizard on personal computers without requiring a gpu. Though the library is optimized for cpu usage, it also supports gpu acceleration using various BLAS backends like OpenBLAS.
llama.cpp works by loading the GGML formatted model, creating a computed graph from the loaded model, tokenize the prompt and feed it to the computed graph using a loop. This will result in generation of a new token each iteration, using top-K and top-P sampling algorithm. The prompt will be updated with the new token, and this updated prompt will be used in the next iteration.
Project Setup:
gcc-toolset-13 is the minimum requirement to build llama.cpp. Enable gcc-toolset-13 and add it to the path.
scl enable gcc-toolset-13 bash
source scl_source enable gcc-toolset-13
export PATH=/opt/rh/gcc-toolset-13/root/usr/bin/:$PATH
- Build llama.cpp from sources
- Download sources using git clone
MMA optimizations for float and quantized int8 datatypes are merged into master. So directly clone the master branch of llama.cpp git repository.
git clone https://github.com/ggerganov/llama.cpp.git
If you are interested in a different branch other than master, then make sure that MMA optimization patch is present.
-
- Use cmake to build llama.cpp.
cmake -B build_llama
cmake –-build build_llama –-config Release
-
- To obtain debug build for llama.cpp:
cmake -B build_llama -DCMAKE_BUILD_TYPE=Debug
cmake --build build_llama
If you already have the model downloaded somewhere, just copy it into ‘models’ directory in llama.cpp sources.
If not, follow the steps given in the following link to download the models from Hugging Face. If model file is gguf format, download that and copy to ‘models’ folder. Otherwise download the entire folder and convert to GGUF format as mentioned in the below link.
https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#obtaining-and-quantizing-models
- Convert the LLM model into GGUF format
- torch is a required package to convert hugging face models to GGUF format. Make sure that torch package is available in your environment.
- Install rest of the dependencies listed in requirements.txt
python3 -m pip install -r requirements.txt
-
- Use convert_hf_to_gguf.py to convert the models.
- To convert into a GGUF model in float format:
python3 convert_hf_to_gguf.py
models/mistral_models/7B-Instruct-v0.3/ --outtype f32
-
-
To convert into a GGUF model in half precession float format:
python3 convert_hf_to_gguf.py
models/mistral_models/7B-Instruct-v0.3/ --outtype f16
bf16, q8_0, tq1_0, tq2_0 are the other supported data types.
- Quantize the models into different quantized formats
If you want to run the quantized models, using below steps. But if you plan to execute the non-quantized models, you can skip this step.
-
- To quantize the models into 4 bits (using Q4_0)
./build_llama/bin/llama-quantize ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-F16.gguf ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-Q4_0.gguf Q4_0
-
- To quantize the models into 8 bits (using Q8_0)
./build_llama/bin/llama-quantize ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-F16.gguf ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-Q8_0.gguf Q8_0
Execution:
Once GGUF model file is obtained, run the model using below command:
- To run 8bit quantized model:
OMP_NUM_THREADS=80 OMP_PLACES="{0}:80:1" ./build_llama/bin/llama-cli -m ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-Q8_0.gguf -n 256 -p 'Please write a python program that calculates the nth fibonacci number' -t 80
OMP_NUM_THREADS=80 OMP_PLACES="{0}:80:1" ./build_llama/bin/llama-cli -m ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-F32.gguf -n 256 -p 'Please write a python program that calculates the nth fibonacci number' -t 80
llama.cpp provides a microbenchmark llama-batched-bench which can be used to evaluate the performance.
OMP_NUM_THREADS=80 OMP_PLACES="{0}:80:1" build_llama/bin/llama-batched-bench -m models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-F32.gguf -c 262144 -b 2048 -ub 512 -npp 32,64,128,512 -ntg 32,64,128,512 -npl 1,2,4,8 -t 80
Note: Set OMP_NUM_THREADS and ‘t’ to an optimal value, which is half the number of physical cores in the system, to get the maximum performance.
Verify the usage of MMA instructions
- Confirm the presence of MMA instructions (xvf32gerpp for float and xvi8ger4pp for int8) in the library.
$cd build_llama
$find . -name libggml-cpu.so ./ggml/src/libggml-cpu.so
$objdump -D build_llama/ggml/src/libggml-cpu.so|grep xvf | wc -l
136
$objdump -D build_llama/ggml/src/libggml-cpu.so|grep xvi8 | wc -l
192
- Collect profile when running a use case and confirm the usage of MMA instructions, using perf stat command with r1000E.
perf stat -e r1000E ./build_llama/bin/llama-cli -m ./models/mistral_models/7B-Instruct-v0.3/7B-Instruct-v0.3-7.2B-7B-Instruct-v0.3-Q8_0.gguf -n 256 -p 'Please write a python program that calculates the nth fibonacci number'
Performance counter stats for 'build_llama/bin/llama-cli -m ./models/Meta-Llama-3-8B/ggml-model-Q8_0.gguf -n
256 -p Please write a python program that calculates the nth fibonacci number':
2,580,391,581 r1000E:u
33.562437810 seconds time elapsed
131.813960000 seconds user
0.330160000 seconds sys
Optional: Use the Linux perf library to view and analyse the model execution profile. Use the perf command to record and report the execution profile:
$ perf record
-g ./build_llama/bin/llama-cli -m ./models/Meta-Llama-3-8B/ggml-model-Q8_0.gguf -n 256 -p 'Please write a python program that calculates the nth fibonacci number'
$perf report
-g –sort dso perf.data
Refer to the following screenshot for a sample execution profile for the run.
