Power Data and AI

Power Data and AI

IBM Power systems provide a robust and scalable platform for a wide range of data and AI workloads, offering benefits in performance, security, and ease of use.


#Servers
#Artificialintelligence
#Power
#APIeconomy
#Power

 View Only

Sizing and configuring an LPAR for AI workloads

By Sebastian Lehrig posted Tue March 26, 2024 01:35 PM

  

Sizing and configuration matters for AI workloads. A well-configured IBM Power LPARs with 6 cores can easily outperform a badly configured LPAR with 10 cores assigned. The two key ingredients you have to look for are (per-chip) memory bandwidth and (per-chip) cores. This blog shows how to optimize those ingredients for AI workloads, so you can get the best out of your IBM Power server. Please feel free to leave questions and additional recommendations in the comments, so I can always keep this guide up-to-date.

IBM Power 101

  • SCM (Single Chip Module): a model that includes 1x IBM Power chip with up to 15 SMT8 cores for IBM Power10 (=15 physical cores per chip) and up to 16 SMT8 cores for IBM Power11 (=16 physical cores per chip)
  • DCM (Dual Chip Module): a module that includes 2x IBM Power chips with up to 24 SMT8 cores for IBM Power10 (=12 physical cores per chip) and up to 30 SMT8 cores for IBM Power11 (=15 physical cores per chip)
  • eSCM (entry Single Chip Module): a module that includes two chips but only one gets the core & memory resource while the other facilitates access to more PCIe interfaces
  • SMT (Simultaneous Multi Threading): multithreaded cores, common configurations are SMT2, SMT4, and SMT8 
  • NUMA (Non-Uniform Memory Access): an IBM Power chip can access its own local memory faster than non-local memory
  • HMC (Hardware Management Console): allows system administrators to manage partitions (LPARs) of IBM Power servers
  • LPAR (Logical PARrtition): a virtualized subset of a server's hardware resources, hosting a dedicated operating system

Optimal Core Configuration by System

Given NUMA, the optimal configuration for cores with IBM Power10 is 12 or 15 core SCMs (E1080), a 24 core DCM (E1050/S1024/L1024) is the second best option, followed by a 20 core DCM (S1022/L1022), and eventually 8 core eSCMs:
System Module Cores per Chip
E1080 12 or 15 core SCMs
(both perform similarly well)
12 or 15
E1050/S1024/L1024 24 core DCMs 12
S1022/L1022 20 core DCMs 10
S1022s 8 core eSCMs 8

LPARs that have a dedicated assignment to the given numbers of cores can easily host multiple smaller AI models. If you want to deploy large language models up to ~13B active parameters, you have to plan for assigning a whole LPAR to such a model. Beyond that, you will need to distribute models over multiple chips (see section further down).

With IBM Power11, the optimal configurations improve to:

System Module Cores per Chip
E1180 16 core SCMs 16
E1150/S1124/L1124 30 core DCMs 15
S1122/L1122 30 core DCMs 15

(note: lower core counts still might perform reasonably well given they can operate at higher frequencies; please share your experiments/comparisons with me!)

NUMA Setup

My main recommendation is to consider the importance of NUMA, as to optimize memory<->core bandwidth:

  1. Confirm the P10 module (e.g., a 2x12 core DCM means that there are 2 DCMs with 12/2=6 cores per chip).
  2. Setup an LPAR that allocates the max. number of cores available on the chip (so if you have 12 cores on the socket with a DCM, allocate 6 dedicated cores to the LPAR). This LPAR then corresponds to a so-called "NUMA node" and can access local memory fast.
  3. Configure the LPAR as dedicated (and not shared) via the HMC.
  4. Enable Power Mode in HMC (for full frequency exploitation).
  5. Set SMT to 2 (but eventually try experimenting with 4 and 8).
  6. (Re)start the machine while ensuring that the LPAR from 2. is started the first; other LPARs should follow later (VIO does not seem to cause conflicts here). This will ensure that it is allocating only cores from a single chip. I have also been given the following recommendations:
    1. This command give you the actual score of affinity of all the LPARs
      lsmemopt -m Server-9105-22A-7832351 -r lpar -o currscore
      ...
      lpar_name=d1302-master-7f5d84f3-000037a2,lpar_id=22,curr_lpar_score=60
      As you see, it has a bad score (60/100)
    2. This command tells you what score you can reach
      lsmemopt -m Server-9105-22A-7832351 -r lpar -o calcscore
      ...
      lpar_name=d1302-master-7f5d84f3-000037a2,lpar_id=22,curr_lpar_score=60,predicted_lpar_score=100
      So in my case it can reach 100/100
    3. This asks to reassign CPU/MEM and prioritise LPAR 22
      optmem -m Server-9105-22A-7832351 -o start -t affinity --id 22
    4. You can check the progress with
      lsmemopt -m Server-9105-22A-7832351
      and then reboot the LPAR to apply.
  7. I have also seen the recommendation of using NUMA distance=10 but have not experimented with it. Would be great if you cross-check and document how you executed those steps and leave a comment!
  8. Test via lscpu (or numactl --show) whether it worked. Ideally you have only 1 NUMA node - only NUMA node0 - with its assigned CPUs.

    Example output (for a 6 core chip configured as SMT8):
    > lscpu
    Architecture:    ppc64le
    Byte Order:     Little Endian
    CPU(s):       48
    On-line CPU(s) list: 0-47
    Thread(s) per core: 8
    Core(s) per socket: 6
    Socket(s):      1
    NUMA node(s):    1
    Model:        2.0 (pvr 0080 0200)
    Model name:     POWER10 (architected), altivec supported
    Hypervisor vendor:  pHyp
    Virtualization type: para
    L1d cache:      32K
    L1i cache:      48K
    L2 cache:      1024K
    L3 cache:      4096K
    NUMA node0 CPU(s):  0-47

System Memory

Memory just needs to be sufficient for what you want to do. If you are planning to work with LLMs, I've seen demands around 50 GB for 13B parameter models. So I often size LPARs for 128 GB. However, it is important to populate all slots with DIMMs for maximizing memory<->core bandwidth (so rather use a few smaller DIMMs than a single big one).

Storage

I haven't come across special storage requirements yet. For running demos and proof-of-experiences, 1 TB disk space will typically easily suffice.

Red Hat OpenShift

  1. Start by sizing & configuring your OpenShift cluster as recommended by Red Hat
  2. Plan for additional capacity based on your general OpenShift workloads
  3. Additionally, for AI workloads, spawn LPARs just as described in this blog and promote these LPARs to dedicated OpenShift worker nodes for AI workloads. That is, there should be a 1:1 relation between such workers and the NUMA-aware LPAR setup discussed here.

Distributing AI workloads over multiple NUMA nodes / LPARs

You can also distribute AI workloads (e.g., model deployments) over multiple NUMA nodes to further increase throughput and lower latencies. I will write another blog on that topic once ready and link it here. 

Next Steps

I hope those guidelines help getting you started with AI on IBM Power. I recommend my blog article on RocketCE to get things started easily and quickly in such an LPAR. Also, I appreciate if you drop some comments, so I can keep refining and improving this guide. Thanks! :-) 

2 comments
162 views

Permalink

Comments

Mon April 08, 2024 05:31 PM

For the OCP LPAR placement , the following blog can also be a reference. 

OCP LAPR placement  

Wed March 27, 2024 08:30 AM

Great article... the principles here are true for memory/processor intensive workloads.