watsonx.ai

A one-stop, integrated, end- to-end AI development studio

View Only

Back to discussions

Expand all | Collapse all

Inference speed for foundation models

1. Inference speed for foundation models

Like
Nicolai Thomsen
Posted Tue February 27, 2024 10:51 AM
| view attached

Reply
Hi all

We are using WatsonX.AI to deploy and consume foundation models, namely the newly-added Mixtral-8x7b model. However, we see that inference with Mixtral8x7b is super slow compared to other larger LLM inference platforms (OpenAI, HF inference). I suspect that this comes down to the compute allocated. Right now a call with ~200 input tokens and 50 output tokens can take upwards of 30+ seconds – 1 minute+ with a LangChain ReAct agent. Way too much to be useful in a customer-facing scenario. If we run the exact same model elsewhere, the latency is ~20% of this

Question 1: Has anyone faced the same issue?

Question 2: Does anyone know how to increase compute allocation or at least see what env (which GPUs, number of vCPU cores etc. ) we are running?

Question 3: Is there a way we can decrease the latency?

Latency test

IBM Client Engineering told us that this would likely be faster if running through Prompt Lab directly. So, we ran a speed/latency test between calling models directly from Prompt Lab and via a deployment. See the results in the image below. The task was to translate a ~947 character (including prompt instructions) text from English to Danish. I have attached the notebook for reference.

#watsonx.ai
#PromptLab
#GenerativeAI

------------------------------
Nicolai Thomsen
------------------------------

Attachment(s)

wxai-latency-test.ipynb 16 KB 1 version

watsonx.ai

watsonx.ai

Inference speed for foundation models

1. Inference speed for foundation models

Additional
Resources

Office

Quick Links

watsonx.ai

watsonx.ai

Inference speed for foundation models

1. Inference speed for foundation models

Related Content

Inference speed for foundation models

Video Tutorial: How to check token usage on watsonx.ai

Refine Models: Fine Tuning Techniques and Process

See watsonx.ai in action - videos and tutorials!

Prompt Engineering - Quickstart guide

Additional Resources

Office

Quick Links

Additional
Resources