View Only

Inference speed for foundation models

  • 1.  Inference speed for foundation models

    IBM Champion
    Posted Tue February 27, 2024 10:51 AM
      |   view attached

    Hi all

    We are using WatsonX.AI to deploy and consume foundation models, namely the newly-added Mixtral-8x7b model. However, we see that inference with Mixtral8x7b is super slow compared to other larger LLM inference platforms (OpenAI, HF inference). I suspect that this comes down to the compute allocated. Right now a call with ~200 input tokens and 50 output tokens can take upwards of 30+ seconds – 1 minute+ with a LangChain ReAct agent. Way too much to be useful in a customer-facing scenario. If we run the exact same model elsewhere, the latency is  ~20% of this

    Question 1: Has anyone faced the same issue?

    Question 2: Does anyone know how to increase compute allocation or at least see what env (which GPUs, number of vCPU cores etc. ) we are running?

    Question 3: Is there a way we can decrease the latency?

    Latency test

    IBM Client Engineering told us that this would likely be faster if running through Prompt Lab directly. So, we ran a speed/latency test between calling models directly from Prompt Lab and via a deployment.  See the results in the image below. The task was to translate a ~947 character (including prompt instructions) text from English to Danish. I have attached the notebook for reference. 


    Nicolai Thomsen


    wxai-latency-test.ipynb   16 KB 1 version