Sizing and configuration matters for AI workloads. A well-configured IBM Power10 LPAR with 6 cores can easily outperform a badly configured LPAR with 10 cores assigned. The two key ingredients you have to look for are (per-chip) memory bandwidth and (per-chip) cores. This blog shows how to optimize those ingredients for AI workloads, so you can get the best out of your IBM Power10 server. Please feel free to leave questions and additional recommendations in the comments, so I can always keep this guide up-to-date.
IBM Power 101
- SCM (Single Chip Module): a model that includes 1x IBM Power10 chip with up to 15 SMT8 cores (=15 physical cores per chip)
- DCM (Dual Chip Module): a module that includes 2x IBM Power10 chip with up to 24 SMT8 cores (=12 physical cores per chip)
- eSCM (entry Single Chip Module): a module that includes two chips but only one gets the core & memory resource while the other facilitates access to more PCIe interfaces
- SMT (Simultaneous Multi Threading): multithreaded cores, common configurations are SMT2, SMT4, and SMT8
- NUMA (Non-Uniform Memory Access): an IBM Power10 chip can access its own local memory faster than non-local memory
- HMC (Hardware Management Console): allows system administrators to manage partitions (LPARs) of IBM Power servers
- LPAR (Logical PARrtition): a virtualized subset of a server's hardware resources, hosting a dedicated operating system
Optimal Core Configuration by System
Given NUMA, the optimal configuration for cores would be 12 or 15 core SCMs (E1080), a 24 core DCM (E1050/S1024/L1024) is the second best option, followed by a 20 core DCM (S1022/L1022), and eventually 8 core eSCMs:
System |
Module |
Cores per Chip |
E1080 |
12 or 15 core SCMs (both perform similarly well) |
12 or 15 |
E1050/S1024/L1024 |
24 core DCMs |
12 |
S1022/L1022 |
20 core DCMs |
10 |
S1022s |
8 core eSCMs |
8 |
LPARs that have a dedicated assignment to the given numbers of cores can easily host multiple smaller AI models. If you want to deploy large language models up to ~20B parameters, you have to plan for assigning a whole LPAR to such a model. Beyond that, you will need to distribute models over multiple chips (see section further down).
NUMA Setup
My main recommendation is to consider the importance of NUMA, as to optimize memory<->core bandwidth:
- Confirm the P10 module (e.g., a 2x12 core DCM means that there are 2 DCMs with 12/2=6 cores per chip).
- Setup an LPAR that allocates the max. number of cores available on the chip (so if you have 12 cores on the socket with a DCM, allocate 6 dedicated cores to the LPAR). This LPAR then corresponds to a so-called "NUMA node" and can access local memory fast.
- Configure the LPAR as dedicated (and not shared) via the HMC.
- Enable Power Mode in HMC (for full frequency exploitation).
- Set SMT to 2 (but eventually try experimenting with 4 and 8).
- (Re)start the machine while ensuring that the LPAR from 2. is started the first; other LPARs should follow later (VIO does not seem to cause conflicts here). This will ensure that it is allocating only cores from a single chip. I have also been given the following recommendations:
-
This command give you the actual score of affinity of all the LPARs
lsmemopt -m Server-9105-22A-7832351 -r lpar -o currscore
...
lpar_name=d1302-master-7f5d84f3-000037a2,lpar_id=22,curr_lpar_score=60
As you see, it has a bad score (60/100)
-
This command tells you what score you can reach
lsmemopt -m Server-9105-22A-7832351 -r lpar -o calcscore
...
lpar_name=d1302-master-7f5d84f3-000037a2,lpar_id=22,curr_lpar_score=60,predicted_lpar_score=100
So in my case it can reach 100/100
-
This asks to reassign CPU/MEM and prioritise LPAR 22
optmem -m Server-9105-22A-7832351 -o start -t affinity --id 22
-
You can check the progress with
lsmemopt -m Server-9105-22A-7832351
and then reboot the LPAR to apply.
- I have also seen the recommendation of using NUMA distance=10 but have not experimented with it. Would be great if you cross-check and document how you executed those steps and leave a comment!
- Test via
lscpu
(or numactl --show
) whether it worked. Ideally you have only 1 NUMA node - only NUMA node0 - with its assigned CPUs.
Example output (for a 6 core chip configured as SMT8):
> lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 8
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Model: 2.0 (pvr 0080 0200)
Model name: POWER10 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 48K
L2 cache: 1024K
L3 cache: 4096K
NUMA node0 CPU(s): 0-47
System Memory
Memory just needs to be sufficient for what you want to do. If you are planning to work with LLMs, I've seen demands around 80 GB for a 20B parameter model. So I often size LPARs for 256 GB. However, it is important to populate all slots with DIMMs for maximizing memory<->core bandwidth (so rather use a few smaller DIMMs than a single big one).
Storage
I haven't come across special storage requirements yet. For running demos and proof-of-experiences, 1 TB disk space will typically easily suffice.
Red Hat OpenShift
- Start by sizing & configuring your OpenShift cluster as recommended by Red Hat
- Plan for additional capacity based on your general OpenShift workloads
- Additionally, for AI workloads, spawn LPARs just as described in this blog and promote these LPARs to dedicated OpenShift worker nodes for AI workloads. That is, there should be a 1:1 relation between such workers and the NUMA-aware LPAR setup discussed here.
Distributing AI workloads over multiple NUMA nodes / LPARs
You can also distribute AI workloads (e.g., model deployments) over multiple NUMA nodes to further increase throughput and lower latencies. I will write another blog on that topic once ready and link it here.
Next Steps
I hope those guidelines help getting you started with AI on IBM Power. I recommend my
blog article on RocketCE to get things started easily and quickly in such an LPAR. Also, I appreciate if you drop some comments, so I can keep refining and improving this guide. Thanks! :-)