Hi all
We are using WatsonX.AI to deploy and consume foundation models, namely the newly-added Mixtral-8x7b model. However, we see that inference with Mixtral8x7b is super slow compared to other larger LLM inference platforms (OpenAI, HF inference). I suspect that this comes down to the compute allocated. Right now a call with ~200 input tokens and 50 output tokens can take upwards of 30+ seconds – 1 minute+ with a LangChain ReAct agent. Way too much to be useful in a customer-facing scenario. If we run the exact same model elsewhere, the latency is ~20% of this
Question 1: Has anyone faced the same issue?
Question 2: Does anyone know how to increase compute allocation or at least see what env (which GPUs, number of vCPU cores etc. ) we are running?
Question 3: Is there a way we can decrease the latency?
Latency test
IBM Client Engineering told us that this would likely be faster if running through Prompt Lab directly. So, we ran a speed/latency test between calling models directly from Prompt Lab and via a deployment. See the results in the image below. The task was to translate a ~947 character (including prompt instructions) text from English to Danish. I have attached the notebook for reference.
#watsonx.ai#PromptLab#GenerativeAI------------------------------
Nicolai Thomsen
------------------------------