Learn how to run Llama2 in Watsonx step by step
Llama2 from MetaAI is the second version of their open source large language model, now available for free for research and commercial use. Llama2 is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. But running big models are still challenging. To run it, we will need a large amount of resources provided by the Watsonx plateform. To increase the challenge, we will install and run a quantized version (using GPTQ) of Llama2. The model we will use is the TheBloke/Llama-2-7b-Chat-GPTQ.
||Act Order (desc_act)
||Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options.
AutoGPTQ is an easy-to-use LLMs quantization package with user-friendly apis, based on the GPTQ algorithm. GPTQ means GENERATIVE POST-TRAINING QUANTIZATION and is used for generative pre-trained transformers. GPTQ is a one-shot weight quantization method based on approximate second-order information, that is both highly accurate and highly-efficient. GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline (source: https://arxiv.org/pdf/2210.17323.pdf).
Due to their massive size, even the inference of large GPT models may require several high-performance GPUs, which limits the usability of these models. By using this quantization, we will allowed to run Llama2 with less resource consumptions.
First of all, you need to verify that your Watson Studio has a Profesional Plan. You can do this by clicking on Administration / Service instances.
Verify your Plan.