Data and AI on Power

IBM Power systems provide a robust and scalable platform for a wide range of data and AI workloads, offering benefits in performance, security, and ease of use.

View Only

Back to Blog List

Prepare your IBM Power10 environments for inferencing with large language models (LLMs)

By VAIBHAV SHANDILYA posted Tue May 07, 2024 07:51 AM

Co-author: Ashwin Srinivas

This is a foundational blog that serves as a building block for setting up an AI environment on IBM Power10. It can be used to build high-level generative AI (gen AI) use cases on IBM Power servers with the help of IBM Watson services (for example, building retrieval augmented generation (RAG) use cases using IBM watsonx Assistant and IBM Watson Discovery).

The infrastructure is enhanced with optimized Power10 libraries, such as OpenBLAS with MMA support, available at the llama.cpp Git repository..

The blog provides the steps to:

Prepare a Power10 on-prem infrastructure needed for Power10 gen AI proof of concepts (PoCs).
Set up open source (Hugging Face) large language models (LLMs) such as Llama 2, DeepSeek, and so on.
Draw inference with LLMs on the Power10 on-prem infrastructure.

Infrastructure setup overview

As a prerequisite to create the environment, users need access to a Power10 logical partition (LPAR) with Red Hat Enterprise Linux.

As an option, with an entitled IBMid, you can reserve an IBM Power10 LPAR instance on IBM Technology Zone.

For the example in this blog, a minimal shared starter configuration for the Power 10 LPAR with the following specifications was used: allocated 0.8 core, 32 GB RAM, and 100 GB storage with Red Hat Linux 9.3.

Refer to Sebastian’s blog on LPAR sizing and configuring AI workloads for your specific requirements.

Prerequisites

A basic terminal with Secure Shell (SSH) client, or a terminal emulator such as PuTTY to connect to your LPAR using SSH.

Power10 infrastructure for LLM inference

Perform the following steps to get the Power10 infrastructure ready:

Use SSH to connect to the provisioned LPAR with your credentials:
$ ssh c<username>@<LPAR IP>
Install toolchain and set environment variables as follows:
$ sudo yum install gcc-toolset-13
$ source /opt/rh/gcc-toolset-13/enable
Build llama.cpp to run LLM inference on the Power10 server:

Please refer to the detailed instructions in this blog post to build an optimized version of llama.cpp from source.

LLM inferencing

To perform LLM inferencing:

Download Llama 2 and DeepSeek models from Hugging Face. $ mkdir $HOME/LLMs $ cd $HOME/LLMs $ wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q8_0.gguf $ wget https://huggingface.co/TheBloke/deepseek-llm-7B-chat-GGUF/resolve/main/deepseek-llm-7b-chat.Q8_0.gguf
Use llama.cpp to run models with your prompts for local inferencing.
$ cd $HOME/llama.cpp/
$ ./build/bin/llama-cli -m $HOME/LLMs/llama-2-7b-chat.Q8_0.gguf -p "What is IBM Power10?"
$ ./build/bin/llama-cli -m $HOME/LLMs/deepseek-llm-7b-chat.Q8_0.gguf -p "What is IBM Power10?"
Note: The above inference runs use the default llama-cpp parameters. Refer to llama-cpp API documentation for a complete list of parameters.
Optional: Use the Linux perf library to view and analyze the model execution profile. Use the perf command to record and report the execution profile:
$ perf record ./build/bin/llama-cli -m ~/LLMs/llama-2-7b-chat.Q8_0.gguf -p "What is IBM Power10?"
$ perf report

Refer to the following screenshot for a sample concise execution profile for the llama-2 model run.

Note: Refer to the Hugging Face website for the latest updates on model availability.

Future enhancements

This blog is likely to be enhanced with instructions for inference with multiple users (concurrent requests).

References

References for additional details:

Sizing and configuring an LPAR for AI workloads: https://community.ibm.com/community/user/powerdeveloper/blogs/sebastian-lehrig/2024/03/26/sizing-for-ai
RocketCE Conda repository: https://anaconda.org/rocketce/repo
Hugging Face Hub: https://huggingface.co/docs/hub/en/models
Llama cpp implementation: https://github.com/ggerganov/llama.cpp

0 comments

106 views

Permalink

https://community.ibm.com/community/user/blogs/vaibhav-shandilya/2024/05/07/prepare-ibm-power10-for-inferencing-with-llms

Data and AI on Power

Data and AI on Power

Prepare your IBM Power10 environments for inferencing with large language models (LLMs)

By VAIBHAV SHANDILYA posted Tue May 07, 2024 07:51 AM

Infrastructure setup overview

Prerequisites

Power10 infrastructure for LLM inference

LLM inferencing

Future enhancements

References

Permalink

Additional
Resources

Office

Quick Links

Data and AI on Power

Data and AI on Power

Prepare your IBM Power10 environments for inferencing with large language models (LLMs)

By VAIBHAV SHANDILYA posted Tue May 07, 2024 07:51 AM

Infrastructure setup overview

Prerequisites

Power10 infrastructure for LLM inference

LLM inferencing

Future enhancements

References

Permalink

Additional Resources

Office

Quick Links

Additional
Resources