Deploying LLMs such as LLAMA 3.1or other transformer-based models requires significant GPU resources. Accurate estimation of GPU capacity is crucial to balance performance, cost, and scalability. This guide explores the variables and calculations needed to determine the GPU capacity requirements for deploying LLMs, incorporating a detailed example with the LLAMA 3.1 70B Instruct model.
Model card: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
Key Factors Affecting GPU Capacity
1. Model Size (Number of Parameters)
- Definition: The total number of learnable parameters in the model.
- Impact: Larger models consume more memory and require more compute power.
2. Data Types and Precision
- Common Data Types:
- FP32 (32-bit floating point)
- FP16/BF16 (16-bit floating point)
- INT8 (8-bit integer)
- Impact: Lower precision reduces memory footprint and can accelerate computation.
3. Context Size (Sequence Length)
- Definition: The maximum length of input sequences the model can process.
- Impact: Longer sequences increase activation memory and computational complexity.
4. Batch Size
- Definition: The number of input samples processed simultaneously.
- Impact: Larger batch sizes improve throughput but require more memory.
5. Safe Tensor Size
- Definition: A format that safely stores tensors to prevent data corruption.
- Impact: Affects how tensors are loaded and stored, influencing memory management.
6. CUDA Graphs
- Definition: A feature in CUDA that captures a sequence of operations to reduce CPU overhead.
- Impact: Improves performance but may introduce additional memory overhead.
Memory Requirements Calculation:
Calculating the memory requirements involves summing up various memory components:
1. Model Parameters Memory Footprint
- Calculation: Memory =Number of Parameters * Size per Parameter
- Size per Parameter depends on the data type:
- FP32: 4 bytes
- FP16/BF16: 2 bytes
- INT8: 1 byte
2. Activation Memory
- Definition: Memory used to store intermediate outputs during forward and backward passes.
- Impact of Context Size: Longer sequences increase the activation memory linearly.
3. Workspace Memory
- Definition: Temporary memory for computations (e.g., for optimizer states, temporary buffers).
- Considerations:
- Varies based on the operations and libraries used.
- Can be optimized with memory-efficient implementations.
4. Safe Tensors Impact
- Definition: Safe tensors ensure data integrity but may require additional memory overhead.
- Impact:
- May slightly increase the overall memory footprint.
- Important for distributed setups to prevent data corruption.
5. CUDA Graphs Overhead
- Definition: CUDA Graphs can improve performance by reducing CPU-GPU synchronization.
- Impact:
- Memory: Minimal overhead but requires enough memory to store the graph.
- Performance: Can significantly improve throughput.
Compute Requirements:
1. GPU Compute Capability
- Definition: The ability of a GPU to perform computations, determined by its architecture and specifications (e.g., CUDA cores, tensor cores).
- Considerations:
- Throughput: Number of operations per second.
- Memory Bandwidth: Affects how quickly data can be read/written.
2. Throughput and Latency Considerations
- Batch Size and Latency:
- Larger batch sizes improve throughput but increase latency.
- Context Size and Compute:
- Longer sequences (context length) increase computational complexity quadratically in self-attention layers.
Practical Optimization Techniques
1. Memory Optimization
- Mixed Precision Training:
- Uses lower precision (FP16/BF16) to reduce memory and accelerate computation.
2. Efficient Inference Techniques
- Quantization:
- Reduces model size by representing weights with lower precision (e.g., INT8/ AWQ INT4).
Calculation
LLAMA 3.1 70B Instruct Model
Scenario: Deploying the LLAMA 3.1 70B model with the following specifications:
- Number of Parameters: 70.6 billion
- Data Type: BF16/FP16 (2 bytes per parameter)
- Context Length: 128k tokens
- Additional Requirements: Space for KV cache, context window, and CUDA graphs
1. Calculate Model Parameters Memory Footprint
- Size per Parameter: 2 bytes (BF16/FP16)
- Total Memory for Model Parameters: Memory=70.6×109×2 bytes=141.2 GB
- Explanation: Each of the 70.6 billion parameters occupies 2 bytes, resulting in approximately 141.2 GB of memory required to load the model parameters.
2. Calculate KV Cache Memory
The Key-Value (KV) cache is used during inference to store intermediate activations, especially important for models with self-attention mechanisms over long contexts.
- Baseline KV Cache Memory for 32k Context: Approximately 14 GB
- Scaling to 128k Context:
- Since 128k is four times 32k, the KV cache memory scales linearly.
- Memory =14 GB×4=56 GB
Explanation: The KV cache memory increases linearly with the context length. For a context length of 128k tokens, the KV cache requires approximately 56 GB of memory.
3. Additional Memory for CUDA Graphs and Overheads
- CUDA Graphs Memory Overhead: Generally minimal but should be accounted for.
- Other Overheads: Memory for activations, workspace, and any additional buffers.
Assuming an estimated overhead of 5% of the total memory so far:
Total Memory =141.2 GB+56 GB=197.2 GB
Memory_overhead=0.05×197.2 GB=9.86 GB
4. Final Memory Requirement
- Total Memory Required: Total Memory=197.2 GB+9.86 GB≈207 GB
Explanation: Adding the overheads to the initial memory gives us a total memory requirement of approximately 207 GB.
5. Compute Requirements
- GPU Memory: Requires a GPU (or combination of GPUs) with at least 210 GB of memory to accommodate the model parameters, KV cache, and overheads.
- GPU Compute Capability: The GPU should support BF16/FP16 precision and have sufficient compute power to handle the large context size.
6. Practical Considerations
- Multi-GPU Setup: Since a single GPU with 210 GB of memory is not commonly available, a multi-GPU setup using model parallelism is necessary.
7. AWS instance selection:
We need to select instance which will be above 210 GB of CPU. P4d.24x instance is the perfect fit to deploy BF16 tensor size (meta llama3.1 70B instruct model). It is having 8 * Nvidia a100 GPU (40 GB each), totaling to 320 GB of GPU.

Sample code to deploy model on AWS Sagemaker from Huggingface hub:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'meta-llama/Meta-Llama-3-70B-Instruct',
'SM_NUM_GPUS': json.dumps(8),
'HUGGING_FACE_HUB_TOKEN': '<REPLACE WITH YOUR TOKEN>'
}
assert hub['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="2.2.0"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="p4d.24xlarge",
container_startup_health_check_timeout=2100,
)
# send request
predictor.predict({
"inputs": "My name is Clara and I am",
})