Data and AI on Power

 View Only
Expand all | Collapse all

Question about MMA, GPU, CPU and types of LLMs

  • 1.  Question about MMA, GPU, CPU and types of LLMs

    Posted Thu July 04, 2024 02:10 PM

    Hello,
    As part of our AI learning, we ran Red hat (ppc64le) on one of our Power10 machine. We are at the beginning of our journey with AI and it is difficult for us to understand some things, so a few questions come to my mind:
    1) How does MMA in Power10 processors compare to technologies such as GPU, TPU? What does it look like in terms of performance? (e.g. I see that a Macbook with an m1 processor generates text faster)
    2) We downloaded two models 8B-SPPO-Iter3-Q8_0.gguf and 8B-SPPO-Iter3-Q6_K.gguf and 8B-SPPO-Iter3-Q8_0.gguf is clearly faster even though it is larger(in theory it should be slower), why is this so? Should we choose a specific type of LLM from huggingface.co for MMA technology?

    We use llama.cpp from this manual https://community.ibm.com/community/user/powerdeveloper/blogs/vaibhav-shandilya/2024/05/07/prepare-ibm-power10-for-inferencing-with-llms



    ------------------------------
    Kamil
    ------------------------------


  • 2.  RE: Question about MMA, GPU, CPU and types of LLMs

    Posted Fri July 12, 2024 05:58 AM

    Hi Kamil,

    1) Power10 provides acceleration for AI workloads directly on each Power10 chip, including capabilities such as MMA, SIMD units, and a high memory bandwidth between system memory and Power10 chip. All of that, not only MMA, improves performance of AI workloads such as inferencing of LLMs. Given that this directly works with the CPU, you can leverage system memory (so you're not restricted to GPU memory) and you don't need to mess around with CUDA. For example, compare https://huggingface.co/google/flan-t5-base#running-the-model-on-a-cpu with https://huggingface.co/google/flan-t5-base#running-the-model-on-a-gpu: running with CPU even leads to easier to understand/maintain code vs. GPU-aware code.

    2) In terms of performance, you need to configure your Power10 server appropriately; then you can easily handle LLMs with billions of parameters; the 8B models you are referencing shouldn't be a problem then: https://community.ibm.com/community/user/powerdeveloper/blogs/sebastian-lehrig/2024/03/26/sizing-for-ai

    We have optimized for INT8 quantization in combination with SIMD/MMA instructions. INT6 probably is not as performant for that reason. 



    ------------------------------
    Sebastian Lehrig
    ------------------------------