Learn how to run Llama2 in Watsonx step by step
Llama2 from MetaAI is the second version of their open source large language model, now available for free for research and commercial use. Llama2 is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. But running big models are still challenging. To run it, we will need a large amount of resources provided by the Watsonx plateform. To increase the challenge, we will install and run a quantized version (using GPTQ) of Llama2. The model we will use is the TheBloke/Llama-2-7b-Chat-GPTQ.
Branch |
Bits |
Group Size |
Act Order (desc_act) |
File Size |
ExLlama Compatible? |
Made With |
Description |
main |
4 |
128 |
False |
3.90 GB |
True |
AutoGPTQ |
Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on the GPTQ algorithm. GPTQ means GENERATIVE POST-TRAINING QUANTIZATION and is used for generative pre-trained transformers. GPTQ is a one-shot weight quantization method based on approximate second-order information, that is both highly accurate and highly-efficient. GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline (source: https://arxiv.org/pdf/2210.17323.pdf).
Due to their massive size, even the inference of large GPT models may require several high-performance GPUs, which limits the usability of these models. By using this quantization, we will allowed to run Llama2 with less resource consumptions.
First of all, you need to verify that your Watson Studio has a Profesional Plan. You can do this by clicking on Administration / Service instances.
Verify your Plan.
The professional plan allows you to use GPUs. If it not the case, click on the triple dots and select Upgrade.
As you can see, this plan gives access to NVIDIA V100 GPUs.
Now create a New task / Work with data and models in Python or R notebooks, then select the runtime related to Python and the usage of a GPU.
You arrive in the Notebook. In the first cell, write this Python code:
This verifies that CUDA is available. If the is_available answer is 'yes', we need to know where CUDA is installed. Due to a bug (?), CUDA_HOME is not defined and must be configured. Then copy-paste the following cells.
/opt/conda/envs/Python-RT23.1-CUDA/bin:/opt/conda/condabin:/opt/conda/bin:/usr/bin:/opt/ibm/dsdriver/bin
import os
os.environ["CUDA_HOME"]="/opt/conda/envs/Python-RT23.1-CUDA/"
We need to install the required Python modules.
!pip install transformers
!pip install auto-gptq
Now let's write the Python class to generate the tokens.
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import time
class Llama2_7B_gptq:
"""
Branch Bits Group Size Act Order (desc_act) File Size ExLlama Compatible? Made With Description
main 4 128 False 3.90 GB True AutoGPTQ Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options.
gptq-4bit-32g-actorder_True 4 32 True 4.28 GB True AutoGPTQ 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed.
gptq-4bit-64g-actorder_True 4 64 True 4.02 GB True AutoGPTQ 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed.
gptq-4bit-128g-actorder_True 4 128 True 3.90 GB True AutoGPTQ 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed.
"""
MODEL_NAME_OR_PATH = "TheBloke/Llama-2-7B-GPTQ"
REVISION = "main" # See branch - gptq-4bit-128g-actorder_True
MODEL_BASENAME = "model"
DEVICE = "cuda:0"
USE_TRITON = False # Works only on Linux today
def __init__(self):
start_time = time.time()
self.tokenizer = AutoTokenizer.from_pretrained(self.MODEL_NAME_OR_PATH, use_fast=True)
self.model = AutoGPTQForCausalLM.from_quantized(self.MODEL_NAME_OR_PATH,
model_basename = self.MODEL_BASENAME,
revision = self.REVISION,
use_safetensors = True,
trust_remote_code = True,
device = self.DEVICE,
use_triton = self.USE_TRITON,
quantize_config = None)
print("Loading model in: {:.2f} seconds ---".format(time.time() - start_time))
def infer(self, query:str, max_new_tokens:int=256, temperature:float=0.9, top_p: float=0.92, top_k:int=0, repetition_penalty:float=1.0, **kwargs) -> str:
print("*********** QUERY:\n{}".format(query))
print("\n*********** PARAMETERS")
print("\t- Maw new tokens : {}".format(max_new_tokens))
print("\t- Temperature : {}".format(temperature))
print("\t- Top P : {}".format(top_p))
print("\t- Top K : {}".format(top_k))
print("\t- Repetition penalty: {}".format(repetition_penalty))
start_time = time.time()
do_sample = True
input_ids = self.tokenizer(query, return_tensors='pt').input_ids.cuda()
output = self.model.generate(inputs=input_ids,
max_new_tokens=max_new_tokens,
min_length=20,
temperature=temperature,
top_p=top_p,
top_k=top_k,
repetition_penalty=repetition_penalty,
do_sample=do_sample,
early_stopping=True,
no_repeat_ngram_size=2,
pad_token_id=self.tokenizer.eos_token_id,)
answer = self.tokenizer.decode(output[0])
delay = time.time() - start_time
print("Inference time: {:.2f} seconds = {} minute(s)---\n\n".format(delay, int(delay / 60)))
return answer
llama2_7b_gptq = Llama2_7B_gptq()
The llama2_7b_gptq class instance is now loaded into Notebook memory and can be used to infer new tokens.
prompt = "Tell me about Watson"
prompt_template=f'''{prompt}
'''
print("\n\n*** Generate:")
print(llama2_7b_gptq.infer(prompt_template))
If all goes well, you'll see your result after your prompt.
<s> Tell me about Watson
I read the article, which described how Watson was developed. But it didn't say how it works. What is it that Watson really is?
Watson is a statistical analysis system that includes both software and services (the ability to perform analysis, store the results, and access the information). The core of the system is its analysis engine, or cognitive software. This part of Watson—the core brain—is what makes Watson able to learn and to teach. The system's ability of _learning_ and _teaching_ comes from Watson'sfoundation, its computational engine and its core software stack.
[...]
Enjoy!
#ArtificialIntelligence(AI)
#watsonx
#cuda
#IBMChampions
#IBMChampion#ibmchampions-featured-library-home#ibmchampions-featured-library