Data and AI on Power

 View Only

Sizing and configuring an LPAR for AI workloads

By Sebastian Lehrig posted Tue March 26, 2024 01:35 PM

  

Sizing and configuration matters for AI workloads. A well-configured IBM Power10 LPAR with 6 cores can easily outperform a badly configured LPAR with 10 cores assigned. The two key ingredients you have to look for are (per-chip) memory bandwidth and (per-chip) cores. This blog shows how to optimize those ingredients for AI workloads, so you can get the best out of your IBM Power10 server. Please feel free to leave questions and additional recommendations in the comments, so I can always keep this guide up-to-date.

IBM Power 101

  • SCM (Single Chip Module): a model that includes 1x IBM Power10 chip with up to 15 SMT8 cores (=15 physical cores per chip)
  • DCM (Dual Chip Module): a module that includes 2x IBM Power10 chip with up to 24 SMT8 cores (=12 physical cores per chip)
  • eSCM (entry Single Chip Module): a module that includes two chips but only one gets the core & memory resource while the other facilitates access to more PCIe interfaces
  • SMT (Simultaneous Multi Threading): multithreaded cores, common configurations are SMT2, SMT4, and SMT8 
  • NUMA (Non-Uniform Memory Access): an IBM Power10 chip can access its own local memory faster than non-local memory
  • HMC (Hardware Management Console): allows system administrators to manage partitions (LPARs) of IBM Power servers
  • LPAR (Logical PARrtition): a virtualized subset of a server's hardware resources, hosting a dedicated operating system

Optimal Core Configuration by System

Given NUMA, the optimal configuration for cores would be 12 or 15 core SCMs (E1080), a 24 core DCM (E1050/S1024/L1024) is the second best option, followed by a 20 core DCM (S1022/L1022), and eventually 8 core eSCMs:
System Module Cores per Chip
E1080 12 or 15 core SCMs
(both perform similarly well)
12 or 15
E1050/S1024/L1024 24 core DCMs 12
S1022/L1022 20 core DCMs 10
S1022s 8 core eSCMs 8

LPARs that have a dedicated assignment to the given numbers of cores can easily host multiple smaller AI models. If you want to deploy large language models up to ~20B parameters, you have to plan for assigning a whole LPAR to such a model. Beyond that, you will need to distribute models over multiple chips (see section further down).

NUMA Setup

My main recommendation is to consider the importance of NUMA, as to optimize memory<->core bandwidth:

  1. Confirm the P10 module (e.g., a 2x12 core DCM means that there are 2 DCMs with 12/2=6 cores per chip).
  2. Setup an LPAR that allocates the max. number of cores available on the chip (so if you have 12 cores on the socket with a DCM, allocate 6 dedicated cores to the LPAR). This LPAR then corresponds to a so-called "NUMA node" and can access local memory fast.
  3. Configure the LPAR as dedicated (and not shared) via the HMC.
  4. Enable Power Mode in HMC (for full frequency exploitation).
  5. Set SMT to 2 (but eventually try experimenting with 4 and 8).
  6. (Re)start the machine while ensuring that the LPAR from 2. is started the first; other LPARs should follow later (VIO does not seem to cause conflicts here). This will ensure that it is allocating only cores from a single chip. I have also been given the following recommendations:
    1. This command give you the actual score of affinity of all the LPARs
      lsmemopt -m Server-9105-22A-7832351 -r lpar -o currscore
      ...
      lpar_name=d1302-master-7f5d84f3-000037a2,lpar_id=22,curr_lpar_score=60
      As you see, it has a bad score (60/100)
    2. This command tells you what score you can reach
      lsmemopt -m Server-9105-22A-7832351 -r lpar -o calcscore
      ...
      lpar_name=d1302-master-7f5d84f3-000037a2,lpar_id=22,curr_lpar_score=60,predicted_lpar_score=100
      So in my case it can reach 100/100
    3. This asks to reassign CPU/MEM and prioritise LPAR 22
      optmem -m Server-9105-22A-7832351 -o start -t affinity --id 22
    4. You can check the progress with
      lsmemopt -m Server-9105-22A-7832351
      and then reboot the LPAR to apply.
  7. I have also seen the recommendation of using NUMA distance=10 but have not experimented with it. Would be great if you cross-check and document how you executed those steps and leave a comment!
  8. Test via lscpu (or numactl --show) whether it worked. Ideally you have only 1 NUMA node - only NUMA node0 - with its assigned CPUs.

    Example output (for a 6 core chip configured as SMT8):
    > lscpu
    Architecture:    ppc64le
    Byte Order:     Little Endian
    CPU(s):       48
    On-line CPU(s) list: 0-47
    Thread(s) per core: 8
    Core(s) per socket: 6
    Socket(s):      1
    NUMA node(s):    1
    Model:        2.0 (pvr 0080 0200)
    Model name:     POWER10 (architected), altivec supported
    Hypervisor vendor:  pHyp
    Virtualization type: para
    L1d cache:      32K
    L1i cache:      48K
    L2 cache:      1024K
    L3 cache:      4096K
    NUMA node0 CPU(s):  0-47

System Memory

Memory just needs to be sufficient for what you want to do. If you are planning to work with LLMs, I've seen demands around 80 GB for a 20B parameter model. So I often size LPARs for 256 GB. However, it is important to populate all slots with DIMMs for maximizing memory<->core bandwidth (so rather use a few smaller DIMMs than a single big one).

Storage

I haven't come across special storage requirements yet. For running demos and proof-of-experiences, 1 TB disk space will typically easily suffice.

Red Hat OpenShift

  1. Start by sizing & configuring your OpenShift cluster as recommended by Red Hat
  2. Plan for additional capacity based on your general OpenShift workloads
  3. Additionally, for AI workloads, spawn LPARs just as described in this blog and promote these LPARs to dedicated OpenShift worker nodes for AI workloads. That is, there should be a 1:1 relation between such workers and the NUMA-aware LPAR setup discussed here.

Distributing AI workloads over multiple NUMA nodes / LPARs

You can also distribute AI workloads (e.g., model deployments) over multiple NUMA nodes to further increase throughput and lower latencies. I will write another blog on that topic once ready and link it here. 

Next Steps

I hope those guidelines help getting you started with AI on IBM Power. I recommend my blog article on RocketCE to get things started easily and quickly in such an LPAR. Also, I appreciate if you drop some comments, so I can keep refining and improving this guide. Thanks! :-) 

Permalink

Comments

Mon April 08, 2024 05:31 PM

For the OCP LPAR placement , the following blog can also be a reference. 

OCP LAPR placement  

Wed March 27, 2024 08:30 AM

Great article... the principles here are true for memory/processor intensive workloads.