IBM TechXchange AWS Cloud Native User Group

AWS Cloud Native User Group

This group focuses on AWS Cloud Native services. All the discussions, blogs and events will be very specific to AWS Cloud.

View Only

Back to Blog List

Calculating GPU Requirements for Efficient LLAMA 3.1 70B Deployment on AWS Sagemaker

By Arindam Dasgupta posted Wed September 18, 2024 03:26 PM

Deploying LLMs such as LLAMA 3.1or other transformer-based models requires significant GPU resources. Accurate estimation of GPU capacity is crucial to balance performance, cost, and scalability. This guide explores the variables and calculations needed to determine the GPU capacity requirements for deploying LLMs, incorporating a detailed example with the LLAMA 3.1 70B Instruct model.

Model card: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

Key Factors Affecting GPU Capacity

1. Model Size (Number of Parameters)

Definition: The total number of learnable parameters in the model.
Impact: Larger models consume more memory and require more compute power.

2. Data Types and Precision

Common Data Types:

FP32 (32-bit floating point)
FP16/BF16 (16-bit floating point)
INT8 (8-bit integer)

Impact: Lower precision reduces memory footprint and can accelerate computation.

3. Context Size (Sequence Length)

Definition: The maximum length of input sequences the model can process.
Impact: Longer sequences increase activation memory and computational complexity.

4. Batch Size

Definition: The number of input samples processed simultaneously.
Impact: Larger batch sizes improve throughput but require more memory.

5. Safe Tensor Size

Definition: A format that safely stores tensors to prevent data corruption.
Impact: Affects how tensors are loaded and stored, influencing memory management.

6. CUDA Graphs

Definition: A feature in CUDA that captures a sequence of operations to reduce CPU overhead.
Impact: Improves performance but may introduce additional memory overhead.

Memory Requirements Calculation:

Calculating the memory requirements involves summing up various memory components:

1. Model Parameters Memory Footprint

Calculation: Memory =Number of Parameters * Size per Parameter
Size per Parameter depends on the data type:

FP32: 4 bytes
FP16/BF16: 2 bytes
INT8: 1 byte

2. Activation Memory

Definition: Memory used to store intermediate outputs during forward and backward passes.
Impact of Context Size: Longer sequences increase the activation memory linearly.

3. Workspace Memory

Definition: Temporary memory for computations (e.g., for optimizer states, temporary buffers).
Considerations:

Varies based on the operations and libraries used.
Can be optimized with memory-efficient implementations.

4. Safe Tensors Impact

Definition: Safe tensors ensure data integrity but may require additional memory overhead.
Impact:

May slightly increase the overall memory footprint.
Important for distributed setups to prevent data corruption.

5. CUDA Graphs Overhead

Definition: CUDA Graphs can improve performance by reducing CPU-GPU synchronization.
Impact:

Memory: Minimal overhead but requires enough memory to store the graph.
Performance: Can significantly improve throughput.

Compute Requirements:

1. GPU Compute Capability

Definition: The ability of a GPU to perform computations, determined by its architecture and specifications (e.g., CUDA cores, tensor cores).
Considerations:

Throughput: Number of operations per second.
Memory Bandwidth: Affects how quickly data can be read/written.

2. Throughput and Latency Considerations

Batch Size and Latency:

Larger batch sizes improve throughput but increase latency.

Context Size and Compute:

Longer sequences (context length) increase computational complexity quadratically in self-attention layers.

Practical Optimization Techniques

1. Memory Optimization

Mixed Precision Training:

Uses lower precision (FP16/BF16) to reduce memory and accelerate computation.

2. Efficient Inference Techniques

Quantization:

Reduces model size by representing weights with lower precision (e.g., INT8/ AWQ INT4).

Calculation

LLAMA 3.1 70B Instruct Model

Scenario: Deploying the LLAMA 3.1 70B model with the following specifications:

Number of Parameters: 70.6 billion
Data Type: BF16/FP16 (2 bytes per parameter)
Context Length: 128k tokens
Additional Requirements: Space for KV cache, context window, and CUDA graphs

1. Calculate Model Parameters Memory Footprint

Size per Parameter: 2 bytes (BF16/FP16)
Total Memory for Model Parameters: Memory=70.6×109×2 bytes=141.2 GB
Explanation: Each of the 70.6 billion parameters occupies 2 bytes, resulting in approximately 141.2 GB of memory required to load the model parameters.

2. Calculate KV Cache Memory

The Key-Value (KV) cache is used during inference to store intermediate activations, especially important for models with self-attention mechanisms over long contexts.

Baseline KV Cache Memory for 32k Context: Approximately 14 GB
Scaling to 128k Context:

Since 128k is four times 32k, the KV cache memory scales linearly.
Memory =14 GB×4=56 GB

Explanation: The KV cache memory increases linearly with the context length. For a context length of 128k tokens, the KV cache requires approximately 56 GB of memory.

3. Additional Memory for CUDA Graphs and Overheads

CUDA Graphs Memory Overhead: Generally minimal but should be accounted for.
Other Overheads: Memory for activations, workspace, and any additional buffers.

Assuming an estimated overhead of 5% of the total memory so far:

Total Memory So Far:

Total Memory =141.2 GB+56 GB=197.2 GB

Overhead Memory:

Memory_overhead=0.05×197.2 GB=9.86 GB

4. Final Memory Requirement

Total Memory Required: Total Memory=197.2 GB+9.86 GB≈207 GB

Explanation: Adding the overheads to the initial memory gives us a total memory requirement of approximately 207 GB.

5. Compute Requirements

GPU Memory: Requires a GPU (or combination of GPUs) with at least 210 GB of memory to accommodate the model parameters, KV cache, and overheads.
GPU Compute Capability: The GPU should support BF16/FP16 precision and have sufficient compute power to handle the large context size.

6. Practical Considerations

Multi-GPU Setup: Since a single GPU with 210 GB of memory is not commonly available, a multi-GPU setup using model parallelism is necessary.

7. AWS instance selection:

We need to select instance which will be above 210 GB of CPU. P4d.24x instance is the perfect fit to deploy BF16 tensor size (meta llama3.1 70B instruct model). It is having 8 * Nvidia a100 GPU (40 GB each), totaling to 320 GB of GPU.

Sample code to deploy model on AWS Sagemaker from Huggingface hub:

import json

import sagemaker

import boto3

from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:

role = sagemaker.get_execution_role()

except ValueError:

iam = boto3.client('iam')

role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models

hub = {

'HF_MODEL_ID':'meta-llama/Meta-Llama-3-70B-Instruct',

'SM_NUM_GPUS': json.dumps(8),

'HUGGING_FACE_HUB_TOKEN': '<REPLACE WITH YOUR TOKEN>'

}

assert hub['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."

# create Hugging Face Model Class

huggingface_model = HuggingFaceModel(

image_uri=get_huggingface_llm_image_uri("huggingface",version="2.2.0"),

env=hub,

role=role,

)

# deploy model to SageMaker Inference

predictor = huggingface_model.deploy(

initial_instance_count=1,

instance_type="p4d.24xlarge",

container_startup_health_check_timeout=2100,

)

# send request

predictor.predict({

"inputs": "My name is Clara and I am",

})

0 comments

16 views

Permalink

https://community.ibm.com/community/user/blogs/arindam-dasgupta/2024/09/18/calculating-gpu-requirements-for-efficient-llama-3

IBM TechXchange AWS Cloud Native User Group

AWS Cloud Native User Group

Calculating GPU Requirements for Efficient LLAMA 3.1 70B Deployment on AWS Sagemaker

By Arindam Dasgupta posted Wed September 18, 2024 03:26 PM

Permalink

Additional
Resources

Office

Quick Links

IBM TechXchange AWS Cloud Native User Group

AWS Cloud Native User Group

Calculating GPU Requirements for Efficient LLAMA 3.1 70B Deployment on AWS Sagemaker

By Arindam Dasgupta posted Wed September 18, 2024 03:26 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources