watsonx.ai

View Only

performances with watsonx.ai and APIs

Maurizio Caronni posted Thu January 30, 2025 06:49 AM

Hi everybody,

We are currently developing a RAG application with Langchain and watsonx.ai APIs.

On the IBM Cloud we use watsonx.ai Runtime with "essentials" plan.

We are currently facing performance issues (answers from the model are slow). Even if the essential plan provide more resources at cost, we are not currently using them (no additional cost in the billing section).

I would like to give to my application more CPU o any calculation resoruce but I cannot find how to configure this feature.

Any suggestion?

Many Thanks

Maurizio

**************************************************

Here some details:

#python imports=
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai import APIClient, Credentials

#Models:

EMBEDDING_MODEL = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
LLM_MODEL = 'meta-llama/llama-3-405b-instruct'
ANSWER_MAX_TOKEN = 1200

Patrick Ejelle-Ndille posted Thu January 30, 2025 09:22 AM

Improve Inference Speed Without Direct CPU/GPU Configuration for LLM

Hi Maurizio, you are not alone in trying to figure this out. Currently, watsonx.ai does not allow direct configuration of CPU or GPU resources the same way typical IaaS or PaaS solutions do (e.g., containers or VMs). Therefore, if you’re looking for a means to tweak the number of CPUs or GPUs, that option simply does not exist now on the essential plan.

In practice, I have noticed that most workloads can be handled effectively by watsonx.ai when max_tokens is in the range of 200–500, less equals more responsiveness. However, once your application starts to exceed 1000 tokens, you may begin to realize a noticeable slowdown in response time because the LLM requires more time to generate more text based on the higher token count.

From my experimentation with various inference parameters on watsonx.ai LLMs, here are some ways to improve performance:

Enable Streamed Responses

Streamed output allows you to start rendering tokens as soon as they’re generated, rather than waiting for the entire completion. While it doesn’t drastically reduce total generation time, it feels more responsive to the end user to see the text being generated.

Adjust max_tokens

You currently have it set to 1200, and higher token limits directly add more latency. Tuning your prompts to be more concise and lowering max_tokens can improve responsiveness. You may want to study your desired output and tune your prompt to be precise with a low max_token

Batch or Parallelize Requests

If you’re sending multiple requests to the LLM for a single objective, consider batching them or using multiple instances to run tasks in parallel. Currently I am experimenting with this.

Cache Frequent Calls

If your RAG pipeline calls the LLM frequently on the same documents, consider implementing caching to avoid redundant requests that may slowdown the response.

Consider Greedy vs. Sampling

Greedy decoding is generally faster because it picks the single most probable token at each step. Sampling-based methods (top-k, top-p, temperature, etc.) add variability than can also introduce additional computation overhead. Also experimenting with this to improve response.

Ultimately, improving your LLM’s inference speed under these constraints boils down to optimizing usage rather than reconfiguring hardware. By adjusting your token limits, decoding strategy, request batching, and caching, you can often achieve a noticeable boost in performance without the ability to directly allocate more CPUs or GPUs.

If you need further scalability, you may consider exploring the higher-tier plans or contact IBM support at support.ibm.com to learn about additional options that they may have to improve inference performance for your use case.

Resources

RAG Development on watsonx.ai - https://www.ibm.com/products/watsonx-ai/rag-development

IBM watsonx.ai Documentation - https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/welcome-main.html

IBM watsonx.ai Developer Hub - https://www.ibm.com/watsonx/developer/

IBM Support - https://www.ibm.com/mysupport/s/?language=en_US

watsonx.ai

performances with watsonx.ai and APIs

Additional
Resources

Office

Quick Links

watsonx.ai

performances with watsonx.ai and APIs

Related Content

Breaking News: Llama 3.1 is here! 🚀🦙

Meta releases new Llama 3.1 models, including highly anticipated 405B parameter variant

How to run WatsonX.ai with LangChain and ask questions to text documents

🚀 Breaking News: Mistral Large 2 is here!

Watsonx.ai v2.1 is now generally available

Additional Resources

Office

Quick Links

Additional
Resources