Data and AI on Power

 View Only

Install and use RocketCE in a Linux LPAR

By Sebastian Lehrig posted Fri February 09, 2024 05:33 AM

  

I regularly get questions like "How do I setup a simple data science environment on IBM Power?", "How can I use OpenCE/RocketCE libraries?", "How can I make use of IBM Power-optimized libraries?", and "How do I get started with AI on IBM Power easily?". Time to put the answer to (e)paper - because it is really not so hard. Just follow this step-by-step guide for a quick start. Enjoy trying it out and leave comments if you have some additional tips and tricks!

Step-by-Step Guide

  1. Spawn a Linux LPAR. You can freely choose the OS (RHEL, Ubuntu, ...). I typically use the latest RHEL.
  2. SSH into your LPAR.
  3. Install micromamba (note there's an interactive configuration script you have to run through):
    dnf install bzip2 libxcrypt-compat vim -y
    "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
    echo "micromamba activate base" >> ${HOME}/.bashrc
    source ${HOME}/.bashrc
  4. Configure micromamba to use RocketCE:
    cat > ~/.condarc <<'EOF'
    # Conda configuration see https://conda.io/projects/conda/en/latest/configuration.html
    auto_update_conda: false
    show_channel_urls: true
    channel_priority: flexible
    channels:
      - rocketce
    - defaults
    EOF
  5. Install Power-optimized Python packages (e.g., Python, PyTorch, ...):

    micromamba install --yes python=3.10 pytorch-cpu mamba conda pip
  6. Give it a try:
    python -c 'import torch; print(torch.__version__)'

Caution: Adding Anaconda's defaults channel to above configuration requires an Anaconda licence if you use it in a commercial context.

Tip: Using conda-forge

Some packages might be unavailable via default and RocketCE Conda channels. In this case, you can try installing packages via the conda-forge channel. Be aware that the conda-forge channel includes community-build packages; whereas the default and rocketce channels provide enterprise-grade builds and support.

  1. Install a packages from conda-forge (e.g., accelerate):
    mamba install --yes 'conda-forge::accelerate'

Tip: Add pip to the mix

Some packages still might be unavailable via Conda channels. In those cases, I recommend trying to install those packages via pip (try to minimize using  pip where possible as to keep your Conda environment clean and well-managed). I typically preconfigure pip to use pre-build Python wheels from a repository by Power champion Marvin Gießing who precompiled some useful wheels, which speeds up package installations:

  1. Optional: configure pip with Marvin's repos (recommended for rapid testing):
    mkdir ~/.pip && \
    echo "[global]" >> ~/.pip/pip.conf && \
    echo "extra-index-url = https://repo.fury.io/mgiessing" >> ~/.pip/pip.conf
  2. Install pre-requisites from conda channels (in this example, these are needed for librosa):
    mamba install --yes 'conda-forge::msgpack-python' 'conda-forge::soxr-python'
  3. Install packages:
    pip install --prefer-binary \
        "librosa" \
      "openai-whisper"

Tip: Use JupyterLab as IDE

JupyterLab is a great IDE for rapid prototyping and often comes in handy:

  1. Install Jupyter Lab:
    mamba install --yes jupyterlab
    mkdir notebooks
  2. Start JupyterLab:
    nohup jupyter lab --notebook-dir=${HOME}/notebooks --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.allow_origin='*' --NotebookApp.token='' --NotebookApp.password='' &
  3. Connect to JupyterLab:
    http(s)://<server>:8888/lab

Technologies Used

  • logical partition (LPAR): a virtualized separate computer from a subset of a computer's hardware resources, similar but more efficient than a virtual machine.
  • RHEL: Red Hat Enterprise Linux, an enterprise-grade linux distribution.
  • micromamba: for managing and resolving Python packages; using Mamba is typically more efficient than the alternative Conda (it's implemented in C++ and way faster than Conda...).
  • OpenCE: an open-source community for build scripts of 300+ Python packages for AI (such as PyTorch, TensorFlow, and even Python itself), ensuring packages are compatible and optimized for IBM Power where possible.
  • RocketCE: a free enterprise-grade distribution of OpenCE; commercial support options are available.
  • JupyterLab: a web-based IDE for (Python) notebooks, code, and data.

Permalink

Comments

Tue March 26, 2024 01:39 PM

For sizing & configuration, I have just published a dedicated blog: https://community.ibm.com/community/user/powerdeveloper/blogs/sebastian-lehrig/2024/03/26/sizing-for-ai

Mon March 25, 2024 06:25 AM

@K_G: This will be sufficient for POCs/demos. I have often deployed my GenAI demos in such environments that allow to showcase already a lot (e.g., using Llama2 and Mistral models on Power10): https://github.com/lehrig/genai-on-ibm-power-demos/

Fri March 22, 2024 01:25 PM

@Sebastian - I have two 24-way S1024 servers running a mix of things for customer POCs and customer-facing demos.  We've got a bunch of RHEL, AIX, Rocky Linux, and an OCP cluster.  I can probably devote 6 right now.  1TB of memory available.

Mon March 04, 2024 08:27 AM

@K G: For AI workloads, my main recommendation is to use a NUMA setup, as to optimize memory<->core bandwidth:

  1. Confirm the P10 module (e.g., a 2x12 core DCM and hence 6 cores per chip)
  2. Setup an LPAR that allocates the max. number of cores available on the chip (so if you have 12 cores on the socket with a dual-chip module, allocate 6 dedicated cores to the LPAR). This LPAR then corresponds to a so-called „NUMA node”. Configure the LPAR as dedicated (and not shared) via the HMC. Enable Power Mode in HMC (for full frequency exploitation). Set SMT to 2 (but eventually try experimenting with 4 and 8). I have also seen the recommendation of using NUMA distance=10 but have not experimented with it. Would be great if you cross-check and document how you executed those steps.
  3. (Re)start the machine while ensuring that the LPAR from 2. is started the first; other LPARs should follow later (VIO does not seem to cause conflicts here). This will ensure that it is allocating only cores from a single chip.
  4. Test via lscpu (or numactl --show) whether it worked. Ideally you have only 1 NUMA node - only NUMA node0 - with its assigned CPUs.
Given that, the best option for cores would be 12 or 15 core SCMs (E1080), a 24 core DCM (E1050/S1024/L1024) is the second best option, followed by 20 core DCM (S1022/L1022).
Memory just needs to be sufficient for what you want to do. If you are planning to work with LLMs, I've seen demands around 80 GB for a 20B parameter model. So I often size LPARs for 256 GB. However, it is important to populate all slots with DIMMs for maximizing memory<->core bandwidth (so rather use a few smaller DIMMs than a single big one).
1 TB disk space will typically easily suffice.
Hope that helps, I should probably write a blog on that better sooner than later :) 
Edit: done - https://community.ibm.com/community/user/powerdeveloper/blogs/sebastian-lehrig/2024/03/26/sizing-for-ai

Sat March 02, 2024 12:47 AM

For grins... how big did you make your Linux VM?  Core/memory/disk space... any other considerations for swap and file system config?