Improve Inference Speed Without Direct CPU/GPU Configuration for LLM
Hi Maurizio, you are not alone in trying to figure this out. Currently, watsonx.ai does not allow direct configuration of CPU or GPU resources the same way typical IaaS or PaaS solutions do (e.g., containers or VMs). Therefore, if you’re looking for a means to tweak the number of CPUs or GPUs, that option simply does not exist now on the essential plan.
In practice, I have noticed that most workloads can be handled effectively by watsonx.ai when max_tokens is in the range of 200–500, less equals more responsiveness. However, once your application starts to exceed 1000 tokens, you may begin to realize a noticeable slowdown in response time because the LLM requires more time to generate more text based on the higher token count.
From my experimentation with various inference parameters on watsonx.ai LLMs, here are some ways to improve performance:
- Enable Streamed Responses
- Streamed output allows you to start rendering tokens as soon as they’re generated, rather than waiting for the entire completion. While it doesn’t drastically reduce total generation time, it feels more responsive to the end user to see the text being generated.
- Adjust max_tokens
- You currently have it set to 1200, and higher token limits directly add more latency. Tuning your prompts to be more concise and lowering max_tokens can improve responsiveness. You may want to study your desired output and tune your prompt to be precise with a low max_token
- Batch or Parallelize Requests
- If you’re sending multiple requests to the LLM for a single objective, consider batching them or using multiple instances to run tasks in parallel. Currently I am experimenting with this.
- Cache Frequent Calls
- If your RAG pipeline calls the LLM frequently on the same documents, consider implementing caching to avoid redundant requests that may slowdown the response.
- Consider Greedy vs. Sampling
- Greedy decoding is generally faster because it picks the single most probable token at each step. Sampling-based methods (top-k, top-p, temperature, etc.) add variability than can also introduce additional computation overhead. Also experimenting with this to improve response.
Ultimately, improving your LLM’s inference speed under these constraints boils down to optimizing usage rather than reconfiguring hardware. By adjusting your token limits, decoding strategy, request batching, and caching, you can often achieve a noticeable boost in performance without the ability to directly allocate more CPUs or GPUs.
If you need further scalability, you may consider exploring the higher-tier plans or contact IBM support at support.ibm.com to learn about additional options that they may have to improve inference performance for your use case.
Resources
RAG Development on watsonx.ai - https://www.ibm.com/products/watsonx-ai/rag-development
IBM watsonx.ai Documentation - https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/welcome-main.html
IBM watsonx.ai Developer Hub - https://www.ibm.com/watsonx/developer/
IBM Support - https://www.ibm.com/mysupport/s/?language=en_US