Global AI and Data Science

 View Only

Learn how to run Llama2 in watsonx step by step

By Patrick Meyer posted Wed August 23, 2023 07:00 PM

  

Learn how to run Llama2 in Watsonx step by step

Llama2 from MetaAI is the second version of their open source large language model, now available for free for research and commercial use. Llama2 is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. But running big models are still challenging. To run it, we will need a large amount of resources provided by the Watsonx plateform. To increase the challenge, we will install and run a quantized version (using GPTQ) of Llama2. The model we will use is the TheBloke/Llama-2-7b-Chat-GPTQ.

Branch Bits Group Size Act Order (desc_act) File Size ExLlama Compatible? Made With Description
main 4 128 False 3.90 GB True AutoGPTQ Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options.

AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on the GPTQ algorithm. GPTQ means GENERATIVE POST-TRAINING QUANTIZATION and is used for generative pre-trained transformers. GPTQ is a one-shot weight quantization method based on approximate second-order information, that is both highly accurate and highly-efficient. GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline (source: https://arxiv.org/pdf/2210.17323.pdf).

Due to their massive size, even the inference of large GPT models may require several high-performance GPUs, which limits the usability of these models. By using this quantization, we will allowed to run Llama2 with less resource consumptions.

First of all, you need to verify that your Watson Studio has a Profesional Plan. You can do this by clicking on Administration / Service instances.

Verify your Plan.

The professional plan allows you to use GPUs. If it not the case, click on the triple dots and select Upgrade.
then selection the Professional plan:
As you can see, this plan gives access to NVIDIA V100 GPUs.
Now create a New task / Work with data and models in Python or R notebooks, then select the runtime related to Python and the usage of a GPU.
You arrive in the Notebook. In the first cell, write this Python code:
import torch

print(torch.randn(1).cuda())
print(torch.cuda.is_available())
This verifies that CUDA is available. If the is_available answer is 'yes', we need to know where CUDA is installed. Due to a bug (?), CUDA_HOME is not defined and must be configured. Then copy-paste the following cells. 
!echo $PATH
/opt/conda/envs/Python-RT23.1-CUDA/bin:/opt/conda/condabin:/opt/conda/bin:/usr/bin:/opt/ibm/dsdriver/bin

Retrieve the CUDA path and assign this value to CUDA_HOME.

import os
os.environ["CUDA_HOME"]="/opt/conda/envs/Python-RT23.1-CUDA/"

We need to install the required Python modules.

!pip install transformers
!pip install auto-gptq

Now let's write the Python class to generate the tokens.

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import time

class Llama2_7B_gptq:

    """
    Branch	                        Bits	Group Size	Act Order (desc_act)	File Size	ExLlama Compatible?	Made With	Description
    main	                        4	    128	        False	                3.90 GB	    True	            AutoGPTQ	Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options.
    gptq-4bit-32g-actorder_True	    4	    32	        True	                4.28 GB	    True	            AutoGPTQ	4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed.
    gptq-4bit-64g-actorder_True	    4	    64	        True	                4.02 GB	    True	            AutoGPTQ	4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed.
    gptq-4bit-128g-actorder_True	4	    128	        True	                3.90 GB	    True	            AutoGPTQ	4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed.
    """

    MODEL_NAME_OR_PATH  = "TheBloke/Llama-2-7B-GPTQ"
    REVISION            = "main" # See branch - gptq-4bit-128g-actorder_True
    MODEL_BASENAME      = "model"
    DEVICE              = "cuda:0"
    USE_TRITON          = False  # Works only on Linux today

    def __init__(self):
        start_time = time.time()
        self.tokenizer = AutoTokenizer.from_pretrained(self.MODEL_NAME_OR_PATH, use_fast=True)

        self.model = AutoGPTQForCausalLM.from_quantized(self.MODEL_NAME_OR_PATH,
                model_basename      = self.MODEL_BASENAME,
                revision            = self.REVISION,
                use_safetensors     = True,
                trust_remote_code   = True,
                device              = self.DEVICE,
                use_triton          = self.USE_TRITON,
                quantize_config     = None)
        print("Loading model in: {:.2f} seconds ---".format(time.time() - start_time))


    def infer(self, query:str, max_new_tokens:int=256, temperature:float=0.9, top_p: float=0.92, top_k:int=0, repetition_penalty:float=1.0, **kwargs) -> str:
            print("*********** QUERY:\n{}".format(query))
            print("\n*********** PARAMETERS")
            print("\t- Maw new tokens    : {}".format(max_new_tokens))
            print("\t- Temperature       : {}".format(temperature))
            print("\t- Top P             : {}".format(top_p))
            print("\t- Top K             : {}".format(top_k))
            print("\t- Repetition penalty: {}".format(repetition_penalty))
            start_time = time.time()
            do_sample = True
            input_ids = self.tokenizer(query, return_tensors='pt').input_ids.cuda()
            output = self.model.generate(inputs=input_ids, 
                                        max_new_tokens=max_new_tokens, 
                                        min_length=20,
                                        temperature=temperature, 
                                        top_p=top_p, 
                                        top_k=top_k, 
                                        repetition_penalty=repetition_penalty,
                                        do_sample=do_sample,
                                        early_stopping=True,
                                        no_repeat_ngram_size=2,
                                        pad_token_id=self.tokenizer.eos_token_id,)
            answer = self.tokenizer.decode(output[0])
            delay = time.time() - start_time
            print("Inference time: {:.2f} seconds = {} minute(s)---\n\n".format(delay, int(delay / 60)))
            return answer


llama2_7b_gptq = Llama2_7B_gptq()

The llama2_7b_gptq class instance is now loaded into Notebook memory and can be used to infer new tokens.

prompt = "Tell me about Watson"
prompt_template=f'''{prompt}
'''
print("\n\n*** Generate:")
print(llama2_7b_gptq.infer(prompt_template))

If all goes well, you'll see your result after your prompt.

<s> Tell me about Watson
 I read the article, which described how Watson was developed. But it didn't say how it works. What is it that Watson really is?
 Watson is a statistical analysis system that includes both software and services (the ability to perform analysis, store the results, and access the information). The core of the system is its analysis engine, or cognitive software. This part of Watson—the core brain—is what makes Watson able to learn and to teach. The system's ability of _learning_ and _teaching_ comes from Watson'sfoundation, its computational engine and its core software stack.
[...]

Enjoy!

#ArtificialIntelligence(AI)

#watsonx

#cuda

#IBMChampions


#IBMChampion
#ibmchampions-featured-library-home
#ibmchampions-featured-library
0 comments
552 views

Permalink