Data and AI on Power

IBM Power systems provide a robust and scalable platform for a wide range of data and AI workloads, offering benefits in performance, security, and ease of use.

View Only

Back to Blog List

Sizing and configuring an LPAR for AI workloads

By Sebastian Lehrig posted Tue March 26, 2024 01:35 PM

Sizing and configuration matters for AI workloads. A well-configured IBM Power10 LPAR with 6 cores can easily outperform a badly configured LPAR with 10 cores assigned. The two key ingredients you have to look for are (per-chip) memory bandwidth and (per-chip) cores. This blog shows how to optimize those ingredients for AI workloads, so you can get the best out of your IBM Power10 server. Please feel free to leave questions and additional recommendations in the comments, so I can always keep this guide up-to-date.

IBM Power 101

SCM (Single Chip Module): a model that includes 1x IBM Power10 chip with up to 15 SMT8 cores (=15 physical cores per chip)
DCM (Dual Chip Module): a module that includes 2x IBM Power10 chip with up to 24 SMT8 cores (=12 physical cores per chip)
eSCM (entry Single Chip Module): a module that includes two chips but only one gets the core & memory resource while the other facilitates access to more PCIe interfaces
SMT (Simultaneous Multi Threading): multithreaded cores, common configurations are SMT2, SMT4, and SMT8
NUMA (Non-Uniform Memory Access): an IBM Power10 chip can access its own local memory faster than non-local memory
HMC (Hardware Management Console): allows system administrators to manage partitions (LPARs) of IBM Power servers
LPAR (Logical PARrtition): a virtualized subset of a server's hardware resources, hosting a dedicated operating system

Optimal Core Configuration by System

Given NUMA, the optimal configuration for cores would be 12 or 15 core SCMs (E1080), a 24 core DCM (E1050/S1024/L1024) is the second best option, followed by a 20 core DCM (S1022/L1022), and eventually 8 core eSCMs:

System	Module	Cores per Chip
E1080	12 or 15 core SCMs (both perform similarly well)	12 or 15
E1050/S1024/L1024	24 core DCMs	12
S1022/L1022	20 core DCMs	10
S1022s	8 core eSCMs	8

LPARs that have a dedicated assignment to the given numbers of cores can easily host multiple smaller AI models. If you want to deploy large language models up to ~20B parameters, you have to plan for assigning a whole LPAR to such a model. Beyond that, you will need to distribute models over multiple chips (see section further down).

NUMA Setup

My main recommendation is to consider the importance of NUMA, as to optimize memory<->core bandwidth:

Confirm the P10 module (e.g., a 2x12 core DCM means that there are 2 DCMs with 12/2=6 cores per chip).
Setup an LPAR that allocates the max. number of cores available on the chip (so if you have 12 cores on the socket with a DCM, allocate 6 dedicated cores to the LPAR). This LPAR then corresponds to a so-called "NUMA node" and can access local memory fast.
Configure the LPAR as dedicated (and not shared) via the HMC.
Enable Power Mode in HMC (for full frequency exploitation).
Set SMT to 2 (but eventually try experimenting with 4 and 8).
(Re)start the machine while ensuring that the LPAR from 2. is started the first; other LPARs should follow later (VIO does not seem to cause conflicts here). This will ensure that it is allocating only cores from a single chip. I have also been given the following recommendations:
1. This command give you the actual score of affinity of all the LPARs
```
lsmemopt -m Server-9105-22A-7832351 -r lpar -o currscore
...
lpar_name=d1302-master-7f5d84f3-000037a2,lpar_id=22,curr_lpar_score=60
```
  As you see, it has a bad score (60/100)
2. This command tells you what score you can reach
```
lsmemopt -m Server-9105-22A-7832351 -r lpar -o calcscore
...
lpar_name=d1302-master-7f5d84f3-000037a2,lpar_id=22,curr_lpar_score=60,predicted_lpar_score=100
```
  So in my case it can reach 100/100
3. This asks to reassign CPU/MEM and prioritise LPAR 22
```
optmem -m Server-9105-22A-7832351 -o start -t affinity --id 22
```
4. You can check the progress with
```
lsmemopt -m Server-9105-22A-7832351
```
  and then reboot the LPAR to apply.
I have also seen the recommendation of using NUMA distance=10 but have not experimented with it. Would be great if you cross-check and document how you executed those steps and leave a comment!

Test via lscpu (or numactl --show) whether it worked. Ideally you have only 1 NUMA node - only NUMA node0 - with its assigned CPUs.

Example output (for a 6 core chip configured as SMT8):

> lscpu
Architecture:    ppc64le
Byte Order:     Little Endian
CPU(s):       48
On-line CPU(s) list: 0-47
Thread(s) per core: 8
Core(s) per socket: 6
Socket(s):      1
NUMA node(s):    1
Model:        2.0 (pvr 0080 0200)
Model name:     POWER10 (architected), altivec supported
Hypervisor vendor:  pHyp
Virtualization type: para
L1d cache:      32K
L1i cache:      48K
L2 cache:      1024K
L3 cache:      4096K
NUMA node0 CPU(s):  0-47

System Memory

Memory just needs to be sufficient for what you want to do. If you are planning to work with LLMs, I've seen demands around 80 GB for a 20B parameter model. So I often size LPARs for 256 GB. However, it is important to populate all slots with DIMMs for maximizing memory<->core bandwidth (so rather use a few smaller DIMMs than a single big one).

Storage

I haven't come across special storage requirements yet. For running demos and proof-of-experiences, 1 TB disk space will typically easily suffice.

Red Hat OpenShift

Start by sizing & configuring your OpenShift cluster as recommended by Red Hat
Plan for additional capacity based on your general OpenShift workloads
Additionally, for AI workloads, spawn LPARs just as described in this blog and promote these LPARs to dedicated OpenShift worker nodes for AI workloads. That is, there should be a 1:1 relation between such workers and the NUMA-aware LPAR setup discussed here.

Distributing AI workloads over multiple NUMA nodes / LPARs

You can also distribute AI workloads (e.g., model deployments) over multiple NUMA nodes to further increase throughput and lower latencies. I will write another blog on that topic once ready and link it here.

Next Steps

I hope those guidelines help getting you started with AI on IBM Power. I recommend my blog article on RocketCE to get things started easily and quickly in such an LPAR. Also, I appreciate if you drop some comments, so I can keep refining and improving this guide. Thanks! :-)

2 comments

137 views

Permalink

https://community.ibm.com/community/user/blogs/sebastian-lehrig/2024/03/26/sizing-for-ai

Comments

Mel Bakhshi

Mon April 08, 2024 05:31 PM

For the OCP LPAR placement , the following blog can also be a reference.

OCP LAPR placement

PAUL BASTIDE

Wed March 27, 2024 08:30 AM

Great article... the principles here are true for memory/processor intensive workloads.

Data and AI on Power

Data and AI on Power

Sizing and configuring an LPAR for AI workloads

By Sebastian Lehrig posted Tue March 26, 2024 01:35 PM

IBM Power 101

Optimal Core Configuration by System

NUMA Setup

System Memory

Storage

Red Hat OpenShift

Distributing AI workloads over multiple NUMA nodes / LPARs

Next Steps

Permalink

Comments

Additional
Resources

Office

Quick Links

Data and AI on Power

Data and AI on Power

Sizing and configuring an LPAR for AI workloads

By Sebastian Lehrig posted Tue March 26, 2024 01:35 PM

IBM Power 101

Optimal Core Configuration by System

NUMA Setup

System Memory

Storage

Red Hat OpenShift

Distributing AI workloads over multiple NUMA nodes / LPARs

Next Steps

Permalink

Comments

Additional Resources

Office

Quick Links

Additional
Resources