IBM TechXchange Group

 View Only

Democratizing AI: My Interview with Brad Topol on InstructLab

By Marius Ciortea posted Wed July 10, 2024 11:10 AM

  


As I've been exploring the world of artificial intelligence, I've become fascinated by the open-source initiatives that are democratizing AI development. One project that's caught my attention is InstructLab (https://instructlab.ai), an innovative collaboration between IBM and Red Hat. I recently had the privilege of slacking with Brad Topol, IBM Distinguished Engineer, Director of Open Technologies, and Chief Developer Advocate, to discuss this exciting project, as he is one of people behind the project.

In our slack conversation, Brad offered me a deep dive into InstructLab's community-driven approach to generative AI and shared his insights on how it could reshape the way we develop and fine-tune large language models. I'm excited to share our slack with you, as I believe it sheds light on a transformative shift in the AI landscape.

For those of you who find this topic as intriguing as I do, I'd highly recommend checking out Brad's upcoming session at TechXchange Day on July 16th (http://ibm.biz/techxchangeday). He'll be presenting on "Leveraging InstructLab for Community Based Tuning and Enterprise Customization of LLMs (ibm.biz/techxchangeday)" which promises to be an excellent opportunity to explore the practical applications of this technology.

Now, let's dive into my interview with Brad and learn more about InstructLab.

Brad, can you tell me about the origins of InstructLab and what motivated IBM and Red Hat to initiate this project?

Brad: InstructLab originated as an IBM research project, first published in early March 2024. The key innovation was developing a method to quickly fine-tune large language models using question-answer pairs and a "teacher model" to generate synthetic data. This approach allows for efficient injection of new knowledge into large language models.

The potential of this technique to enable community contributions to large language models excited IBM and Red Hat executives. They saw it as a way to overcome the inefficiency of creating numerous model variants, instead allowing a whole community to benefit from shared knowledge contributions.

A new development team, composed of the original IBM research team and elite open source contributors from both IBM and Red Hat, was quickly formed. In about 80 days, they transformed the proof of concept into a fully featured open source project, complete with a robust command line interface, refined taxonomy model, governance documents, and more.

I'm intrigued by how InstructLab emphasizes community contributions from both technical and non-technical experts. How do you see this diversity of input improving the model's performance?

Brad: The beauty of InstructLab is its accessibility. Contributors can add new knowledge through question-answer pairs and associated markdown documents, or add new skills solely through questions and answers written in YAML. This approach significantly lowers the technical barrier, enabling a much broader range of subject matter experts to contribute. By allowing more diverse inputs, we can improve the model's performance across a wider range of topics and applications.

The project claims to be cost-effective. Can you elaborate on how InstructLab manages to keep costs down while still producing high-quality models?

Brad: InstructLab achieves cost-effectiveness in several ways. First, its fine-tuning phase is very efficient and runs quickly, contrasting with other training approaches that can take months. Additionally, it allows users to start with a smaller model and enhance it with new knowledge, skills, and synthetic data generated from a larger "teacher" model. The result is a highly customized smaller model that often outperforms larger, non-tuned models on specific tasks, offering better performance at a lower cost.

Could you walk me through the process of how a non-technical person might contribute their knowledge or skills to InstructLab?

Brad: The process is quite straightforward:

1. Users chat with their chosen model using InstructLab on their laptop.

2. If they encounter an incorrect answer, they can create new question-answer pairs and a markdown document.

3. This new content is added to the appropriate location in the InstructLab taxonomy folder.

4. Users can then validate the format, generate synthetic data, and train a new version of the model on their laptop.

5. After testing the new model and confirming improvements, they can submit their knowledge as a pull request for the InstructLab triage team to review and potentially add to the original model.

This process allows even non-technical users to contribute meaningful improvements to the model.

Finally, Brad, what's your vision for InstructLab's future? How do you see it evolving in the coming years?

Brad: We see InstructLab continuing to grow, becoming more efficient and user-friendly. We anticipate developing better tools that will allow users to contribute knowledge and skills without needing to know YAML or markdown.

We also expect to see many enterprise companies using InstructLab behind their firewalls to customize large language models with proprietary data. This will enhance the efficacy of their AI solutions beyond what's possible with out-of-the-box large language models.

Furthermore, we envision companies leveraging InstructLab to improve results when using existing Retrieval-Augmented Generation (RAG) techniques, further expanding its utility in various AI applications.

I feel excited about the potential of InstructLab. It represents a significant step forward in democratizing AI development and improvement. By lowering barriers to contribution and fostering a diverse community of contributors, it promises to accelerate the pace of AI advancement while ensuring that the resulting models are more representative and capable across a wide range of domains. As the project evolves, it will be fascinating to see how it shapes the future of AI development and application in both open-source and enterprise contexts.

Let me know your thoughts in the comments below.

Marius Ciortea in collaboration with Brad Topol and proofread by Claude

#techxchangeday #instructlab #ai #opensourceai #techxchange #ibm #redhat #LLM #ibmdeveloper

0 comments
9 views

Permalink