Co-author: Ashwin Srinivas
This is a foundational blog that serves as a building block for setting up an AI environment on IBM Power10. It can be used to build high-level generative AI (gen AI) use cases on IBM Power servers with the help of IBM Watson services (for example, building retrieval augmented generation (RAG) use cases using IBM watsonx Assistant and IBM Watson Discovery).
The infrastructure is augmented with optimized Power10 libraries, such as OpenBLAS and PyTorch, available at the RocketCE Conda channel.
The blog provides the steps to:
- Prepare a Power10 on-prem infrastructure needed for Power10 gen AI proof of concepts (PoCs).
- Set up open source (Hugging Face) large language models (LLMs) such as Llama 2, DeepSeek, and so on.
- Draw inference with LLMs on the Power10 on-prem infrastructure.
Infrastructure setup overview
As a prerequisite to create the environment, users need access to a Power10 logical partition (LPAR) with Red Hat Enterprise Linux.
As an option, with an entitled IBMid, you can reserve an IBM Power10 LPAR instance on IBM Technology Zone.
For the example in this blog, a minimal shared starter configuration for the Power 10 LPAR with the following specifications was used: allocated 0.8 core, 32 GB RAM, and 100 GB storage with Red Hat Linux 9.3.
While in this blog we use Conda, there are other options such as mamba and miniconda to create an AI environment in an Power10 LPAR.
Refer to Sebastian’s blog on LPAR sizing and configuring AI workloads for your specific requirements.
Prerequisites
A basic terminal with Secure Shell (SSH) client, or a terminal emulator such as PuTTY to connect to your LPAR using SSH.
Power10 infrastructure for LLM inference
Perform the following steps to get the Power10 infrastructure ready:
-
Use SSH to connect to the provisioned LPAR with your credentials:
$ ssh c<username>@<LPAR IP>
-
Install toolchain and set environment variables as follows:
$ sudo yum install gcc-toolset-13
$ source /opt/rh/gcc-toolset-13/enable
- Download and install Anaconda:
$ wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-ppc64le.sh
$ bash Anaconda3-2023.09-0-Linux-ppc64le.sh
- Add the required Conda channels and the packages using the following commands:
$ conda config --prepend channels rocketce
$ conda config --append channels conda-forge
$ conda install pytorch-cpu -c rocketce
$ conda install gfortran -c conda-forge
$ conda install openblas -c rocketce
-
Add RHEL packages.
$ sudo yum -y install git make cmake pkgconfig perf
- Build llama.cpp with OpenBLAS.
$ export PKG_CONFIG_PATH=$HOME/anaconda3/lib/pkgconfig:$PKG_CONFIG_PATH
$ export LD_LIBRARY_PATH=$HOME/anaconda3/lib:$LD_LIBRARY_PATH
$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ make LLAMA_OPENBLAS=1
LLM inferencing
To perform LLM inferencing:
-
Download Llama 2 and DeepSeek models from Hugging Face.
$ mkdir $HOME/LLMs
$ cd $HOME/LLMs
$ wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q8_0.gguf
$ wget https://huggingface.co/TheBloke/deepseek-llm-7B-chat-GGUF/resolve/main/deepseek-llm-7b-chat.Q8_0.gguf
-
Use llama.cpp to run models with your prompts for local inferencing.
$ cd $HOME/llama.cpp/
$ ./main -m $HOME/LLMs/llama-2-7b-chat.Q8_0.gguf -p "What is IBM Power10?"
$ ./main -m $HOME/LLMs/deepseek-llm-7b-chat.Q8_0.gguf -p "What is IBM Power10?"
Note: The above inference runs use the default llama-cpp parameters. Refer to llama-cpp API documentation for a complete list of parameters.
You can verify BLAS inclusion in llama.cpp system_info: | BLAS = 1 |
- Optional: Use the Linux perf library to view and analyze the model execution profile. Use the
perf
command to record and report the execution profile:
$ perf record ./main -m ~/LLMs/llama-2-7b-chat.Q8_0.gguf -p "What is IBM Power10?"
$ perf report
Refer to the following screenshot for a sample concise execution profile for the llama-2 model run.
Note: Refer to the Hugging Face website for the latest updates on model availability.
Future enhancements
This blog is likely to be enhanced with instructions for inference with multiple users (concurrent requests) later this year.
References
References for additional details: