Global AI and Data Science

 View Only

Frequently Ased Questions on Large Language Models for Enterprise Applications

By NICK PLOWDEN posted Fri March 08, 2024 09:57 AM

  

This post is an attempt to summarize (and answer) the most commonly repeated questions that I have fielded from my clients in the last 6 months.

Which models should I use? Which model(s) are the best?

Well, there is no such thing as the perfect model. It all depends upon your context, and your use case. Start with an open source model such a Llama-2 70b, experiment from there, understand the costs, guardrails, monitoring, performance, impact to your MLOps pipeline, and go from there.

How are LLMs priced? What is the business case for using GenAI?

Almost all LLMs are priced on the same metrics:

  1. Tokens sent and received (per 1K tokens)
  2. Number of users
  3. Fine Tuning models (again, per 1K tokens)

If you have 500 users making 50 calls/day to the LLM, and each call is say 2K tokens (sent+received), your total monthly tokens = 500 * 50 * 2000 * 30 = 1,500,000,000 tokens. Inference pricing for LLaMA-2–70B on IBM’s watsonx.ai for example is 0.0018 per 1K tokens, so the monthly cost for inferencing = (1,500,000,000 tokens /1000) * 0.0018 = $2,700.00.

For the same use case, if a different model is picked (let’s say flan-t5-xl-3b), the monthly cost for inferencing = (1,500,000,000 tokens /1000) * 0.0006 = $900.00

See OpenAI pricing page here, IBM watsonx.ai pricing here.

Can I deploy a LLM on premises (or my choice of hyperscaler)?

Err, it depends.

  1. OpenAI’s ChatGPT and Other Models: OpenAI offers API access to their models, including ChatGPT, which can be integrated into applications running on any cloud or on-premises environment. However, the models themselves are hosted by OpenAI, meaning you cannot directly deploy the model binaries in your environment but can interact with them through API calls​​. Similarly, Google hosts and provide API access to Gemini, just like Anthropic provides access to Claude.
  2. LLaMA and Other Open-Source Models by Meta: Meta has released models like LLaMA under open licenses, allowing for more flexibility in deployment. LLaMA is available on HuggingFace, AWS, Azure, and watsonx.

What are parameters? (Or) I will pick the model with the biggest parameter — thats all we need, right?

In LLMs, parameters refer to the numerical values within the model’s artificial neural network architecture. These values determine the strength and connections between individual neurons of the neural network that powers the LLM, shaping the model’s ability to learn and process information.

  • Parameters act as tunable knobs during the training process. By adjusting these values based on the training data, the model learns complex relationships between words, sentences, and broader concepts.
  • The number of parameters directly influences the model’s capacity and complexity. Generally, more parameters enable the model to capture intricate patterns and nuances in the training data, potentially leading to improved performance on various language tasks.

Impact:

  • Enhanced performance: LLMs with larger parameter counts tend to achieve better results on benchmarks like language translation, question answering, and text summarization.
  • Greater computational demands: Training and running models with a vast number of parameters requires significant computational resources, including powerful GPUs and substantial memory.
  • Potential for overfitting: If not carefully tuned, models with excessive parameters can become overly reliant on specific training data patterns, hindering their ability to generalize to unseen examples.
LLM Parameters and their impact

It’s important to remember that the number of parameters is just one factor influencing an LLM’s performance. Other aspects like the model architecture, training data quality, and optimization techniques also play crucial roles. While increasing parameters can often lead to improvements, it’s not a guaranteed path to superior performance, and careful consideration of trade-offs between complexity, efficiency, and generalizability is essential.

OK, so how do I select the right LLM?

Here’s a more comprehensive approach to selecting an LLM:

Criteria for selecting the right LLM

1. Understand your specific needs:

  • Task: What specific language tasks do you intend to use the LLM for in your project? (e.g., text generation, translation, question answering)
  • Data: How sensitive is the data that you will be sending to the LLM? Can that data leave your premises (or VPC in a cloud)?
  • Scalability — How would your project / use-case scale? Do the number of API calls (which incur cost) scale as the project matures & adoption increases?

2. Model Characteristics:

  • Performance requirements: What level of accuracy, fluency, and efficiency do you need from the model?
  • Size — How big is the model? Can you host in privately if you need to? Consider the hardware and computational resources available to you. Models with a vast number of parameters might require powerful GPUs or TPUs for efficient operation.
  • Cost — What is the cost of the model? How is the model priced for inferencing and training? What would be the environmental impact of training & fine tuning the model?

3. Ethical Considerations:

  • What bias, fairness and transparency capabilities does the model (or the model provider) offer? Do you have visibility into the dataset used to train the model, and the model performance against various benchmarks?

4. Vendor Support:

  • What is the community around the model which uses the model (and its API)? What extensions are offered to use the model?
  • How is the documentation for the model? Can you fine tune the model easily? Are there code examples to get you started?
  • How frequently is the model updated? When the model goes under maintenance does the service get affected completely?

Once you’ve narrowed down your options based on the factors mentioned above, experiment and compare.

  • If possible, try out different models on your specific data and tasks to gauge their performance firsthand. This can provide valuable insights beyond theoretical comparisons based solely on parameters.
  • Consider the parameter count as an additional data point. Generally, models with more parameters tend to perform better on complex tasks, but this is not always a guarantee.
  • Be cautious of overfitting, where models with excessive parameters become overly reliant on training data and struggle to generalize to unseen examples.

Remember: Choosing the right LLM is an iterative process that requires careful consideration of your specific needs, available resources, and a holistic evaluation of various factors beyond just the number of parameters.

Can you guarantee your model will not hallucinate?

Completely eliminating hallucinations from LLMs remains an ongoing challenge.

Minimizing hallucinations in responses involves a combination of strategies during the model’s development, training, and deployment phases, as well as careful handling by users. Here are several approaches to consider:

Techniques to minimize hallucations

1. During Development & Training:

  • High-quality, accurate, and relevant training data: Using clean, factual, and well-structured data helps reduce the likelihood of the model generating outputs based on biases, inconsistencies, or false information present in the training data.
  • Data filtering and debiasing: Techniques like removing irrelevant or harmful content from training data can further minimize the model’s exposure to misleading or biased information. Transferring knowledge from a larger, pre-trained model to a smaller one can improve the accuracy and reduce the risk of hallucinations in the smaller model.
  • Model Architecture and Training Techniques: Regularization techniques (such as dropout, weight decay, and early stopping) methods help prevent overfitting, where the model becomes overly reliant on specific training patterns and struggles to generalize to unseen examples.

2. Runtime / Inferencing time Adjustments:

  • Crafting clear and specific prompts: Providing detailed instructions and relevant context to the LLM can guide its generation process and reduce the likelihood of it going off on tangents or producing factually incorrect outputs. Providing the model with a few examples (Few-shot learning) of desired outputs for a given task can help it understand the expected format and content, leading to more accurate and relevant generation.
  • Confidence scoring: Techniques like assigning confidence scores (and filtering responses lower than a certain threshold) to generated outputs can reduce unreliable or factually incorrect information.
  • Human-in-the-loop validation: Integrating human oversight and review processes to catch and correct hallucinations before they are used in downstream applications.

How can you protect us from any IP infringement claims?

One of the more difficult things for LLMs to identify is high-quality, accurate, unbiased data sources.

Most LLMs are trained on a corpus created by crawling the web, ie, The Common Crawl — an ever growing corpus with over 250 billion pages amassed over the past 17 years and is adding 3-to-5 billion pages per month! It contains a wealth of information such as C4, GitHub, Books, Wikipedia, StackExchange, etc., but none of the content it has collected is vetted so there is no guarantee of the accuracy, biases or other anomalies that may exist in the content.

Approaches for protecting clients from IP infingement in LLM Use

Protecting clients from IP infringement claims arising from LLM use is a complex issue with no easy solutions. LLM providers (such as IBM, OpenAI, Google, Anthropic, etc) can implement various strategies to minimize the risk, but there are inherent limitations and ongoing legal uncertainties. Here are some strategies that LLM providers are using to protect their clients:

  1. Transparency — Provide visibility into the datasets used to train the model. For example, LLM 360 provides training data with full data sequence, source code, logs and metrics.
  2. Content Filtering — Ensure the training data used for developing LLMs does not infringe on existing copyrights. This means use data to train the LLMs that is in the public domain, licensed for such use, or created specifically for training purposes.
  3. User Agreements — Provide clear guidance to the users of the terms of use and IP policies with regards to LLM output, and dependence on data sources, etc.
  4. The LLM Provider should have an AI Ethics board which regularly audits the LLM outputs to catch any potential IP infringments.
  5. LLM Providers should collaborate with content providers to ensure respectul and lawful use of content.
  6. Finally, providing education and resources to users about IP laws and the potential risks involved in using AI-generated content can empower users to make informed decisions and take necessary precautions, such as obtaining licenses for commercial use of generated content​.

I am starting with internal use cases. I don’t need any governance. Right?

Umm, starting with internal use cases for AI or any technology project is a common approach, especially when you’re in the exploratory phase or aiming to understand the technology’s capabilities and limitations without the pressure of external expectations. However, the assumption that you don’t need any governance for internal use cases is not entirely accurate. Even for internal projects, certain aspects of governance are crucial for ensuring the project’s success and mitigating risks.

Here’s why governance is important, even for internal use cases:

  • Data Privacy and Security: Even if a project is intended for internal use, it may involve processing or accessing sensitive company data. Governance policies help in ensuring that data is handled securely and in compliance with any relevant privacy laws and company policies.
  • Scalability and Maintenance: Governance frameworks can set the stage for future scalability and maintenance of the project. Without governance, an internal project may be developed in a way that makes it difficult to scale or maintain.
  • Improves Transparency and Accountability: Establishing clear guidelines and oversight mechanisms can enhance transparency around GenAI development and deployment, making it easier to understand how decisions are made and who is accountable for outcomes.
  • Facilitates Responsible Innovation: A well-defined governance framework can encourage responsible innovation by promoting ethical considerations, fairness, and alignment with organizational values throughout the GenAI lifecycle.

Now, I agree to start small and simple, and add other things over time. Start by focusing on transparency, monitor closely and scale up governance as use cases expand or risk increases.

We have a mature AI practice with “thousands of predictive AI models” in production. How do we add governance?

Oh boy. (My IBM bias kicking in) Here’s what I suggest:

1. Assess Current State:

  • Inventory and document your existing models: Catalog all models in production, including their purpose, data sources, training methods, performance metrics, and deployment environments. IBM’s watsonx.governance can help discover all models by importing model metadata, regardless of where the models are running.
  • Identify potential risks and biases: Start monitoring the models with watsonx.governance (specifically, the OpenScale component) to analyze your models for potential biases, fairness issues, and explainability challenges.
  • Evaluate compliance requirements: Understand relevant regulations and industry standards that apply to your AI practices and data usage.

2. Develop a Governance Framework:

  • Define core principles: Establish clear principles for responsible AI development and deployment, considering fairness, transparency, accountability, and alignment with your organizational values. For example, see IBM’s principles & pillars for AI.
  • Establish roles and responsibilities: Assign clear roles and responsibilities for various aspects of governance, including model development, deployment, monitoring, and oversight.
  • Define governance processes: Implement processes for model development lifecycle management, risk assessment, bias mitigation, and continuous monitoring.

3. Benefits of IBM watsonx.governance:

Example watsonx.governance dashboard for AI governance
  • Centralized model management: watsonx.governance provides a central platform to inventory, track, and manage all your AI models, streamlining governance processes.
  • Automated risk assessment: The platform utilizes AI-powered risk assessments to identify potential biases, fairness issues, and explainability challenges in your models.
  • Regulatory compliance support: The platform offers features to help you track compliance with relevant regulations and document your governance practices.

Remember, implementing effective AI governance is an ongoing process. By leveraging a tool like IBM watsonx.governance, and continuously improving your practices, you can ensure responsible and ethical development and deployment of your AI models, even within a large and established practice.

We have lot of documents with “if and then” conditions. We don’t think LLMs will be able to parse these documents and give a valid answer. (More of a statement than question).

It’s not entirely accurate to say that LLMs are incapable of parsing documents with “if-then” conditions and providing valid answers. LLMs can be quite proficient at parsing and understanding documents with “if and then” conditions, contrary to what might be initially assumed.

The tokenization process breaks down text into manageable pieces which helps in processing and understanding the text at a granular level. One can argue tokenization enables parsing of conditional statements, allowing the model to recognize and differentiate between the conditional (“if”) part and the consequent (“then”) part of statements.

There is some recent research that suggests LLMs can learn causal relationships, although there are limitations. LLMs might struggle with highly complex or nested “if-then” statements involving intricate logical reasoning or domain-specific knowledge. The quality and relevance of the training data significantly impact the LLM’s ability to handle specific types of conditional statements. Further, some experiments show that compared to text-only LLMs, Code-LLMs with code prompts are significantly better in causal reasoning.

However, LLMs might not be perfect at handling all “if-then” scenarios, they have shown promising capabilities in this area. Here are some suggestions to improve the likelihood of success, supported by research:

  • Choose an LLM specifically designed for code or logic processing: LLMs are trained on code or logic-heavy datasets are potentially better better suited for reasoning tasks.
  • Provide clear and well-structured data: Ensure your documents are well-formatted and use consistent language for “if-then” statements.
  • Start with simpler examples: Begin by testing the LLM with straightforward “if-then” conditions and gradually increase complexity as it demonstrates success.

Our data is not accurate but we want to get started with GenAI. Should we focus on data accuracy first?

The old adage of “garbage in, garbage out” still holds true. Building applications using GenAI models that fetch inaccurate data for their context (for RAG use cases), leads to unreliable and potentially misleading outputs. GenAI models can amplify existing biases and inaccuracies present in the training data, potentially leading to discriminatory or harmful outputs.

Data integrity is massively important as it enables organizations to avoid costly consequences, make better decisions, implement more reliable processes and reduce compliance issues. It can also lead to better customer and stakeholder experiences, increased revenue and market share and reduced risk. Without quality data, companies will have a hard time managing these increasingly complex applications and ecosystems.

Strategies for Moving Forward:

  1. Data Cleaning and Improvement:
  • Focus on data quality: Prioritize cleaning, correcting, and verifying your data to ensure its accuracy and relevance for your intended GenAI application.
  • Implement data governance practices: Establish processes to ensure data quality is maintained throughout the GenAI lifecycle, from collection to deployment.

2. Start Small and Experiment:

  • Pilot projects: Begin with small-scale GenAI projects using high-quality data to gain experience and assess the potential benefits and limitations of the technology. Focus on a specific, well-defined problem and use a small dataset to train your model.
  • Continuously monitor and evaluate: Monitor and evaluate the performance of your GenAI models, making adjustments to data and training processes as needed.

We want to build our own LLMs. How do we get started?

Allow me to introduce you to my “LLM tuning hierarchy” of needs model :)

LLM Tuning Hierarchy
  • Building from Scratch: This requires the most effort as it involves designing the entire LLM architecture, training it on a massive dataset, and optimizing its performance. This is a complex and resource-intensive undertaking, typically undertaken by research institutions or large companies with significant expertise and computational resources. Your team must understand the theoretical foundations of LLM architectures; collect, clean and prepare massive datasets; Acquire and manage large-scale GPU farm; Engage in ongoing research & development to ultimately build the LLM.
  • Retraining an LLM: This involves taking an existing pre-trained LLM and training it on a new dataset specific to your task. This requires less effort than building from scratch but still demands significant computational resources and expertise in training large language models. Your team must still gather the domain-specific datasets; modify (or tune) the hyperparameters; and ultimately incur the lower (but still substantial) costs of the GPU infrastructure.
  • Fine-tuning an LLM: This involves making smaller adjustments to an existing pre-trained LLM by fine-tuning specific parameters on a smaller dataset relevant to your task. This requires less effort than retraining and is often used for tasks like question answering or sentiment analysis. Your team will amend a pre-trained model with a modest dataset (so focus on data quality as this article discusses); tune the model based on specific use-case; incur proportionally lower costs.
  • Prompt Tuning/Prompt Engineering: This involves crafting specific prompts and instructions that guide the LLM towards generating the desired outputs. This requires the least effort but still necessitates understanding the LLM’s capabilities and tailoring prompts effectively. Your team must provide a minimal dataset; and focus on prompt design for specific requests.
0 comments
2 views

Permalink