Introduction
In the era of rapid artificial intelligence development, large language models (LLMs) are changing how we process and generate language. A recent study ran over 4,000 experiments using up to 512 GPUs to investigate efficient training practices. The study focused on two key metrics: throughput (measured in tokens) and GPU utilization. Both metrics were normalized by model size to compare performance across hardware configurations.
Three Major Challenges in AI Model Training
Memory Usage: Limited Capacity
Memory is a hard constraint during training. If a training step requires more memory than a GPU can provide, the training cannot proceed. Managing complex computations within limited GPU memory is essential.
Compute Efficiency: Keeping Hardware Busy
High-performance GPUs are costly, and ensuring they spend most of their time on computation rather than waiting for data or other GPUs is critical. Idle time caused by data transfers or synchronization reduces overall throughput. The goal is to minimize these wait times and maximize compute utilization.
Communication Overhead: Reducing Internal Overhead
When multiple GPUs collaborate, frequent information exchange can consume time and resources and leave GPUs idle. Efficiently using intra-node (fast) and inter-node (slower) bandwidth, and overlapping communication with computation, reduces communication overhead.
Starting from a Single GPU
Before scaling to thousands of GPUs, begin with single-GPU training. This foundational step is essential for understanding the training process and debugging performance and memory issues.
Three Basic Steps in Single-GPU Training
- Forward pass: Feed inputs through the model to obtain outputs.
- Backward pass: Compute gradients to identify how the model should change.
- Optimization step: Update model parameters using the computed gradients.
Key Hyperparameter: Batch Size
Batch size determines how much data is fed to the model each step and strongly affects training dynamics and efficiency.
Small Batch Sizes
Small batches can help the model discover useful directions early, but gradients tend to be noisy, which may hinder final performance if kept too small.
Large Batch Sizes
Large batches yield more accurate gradient estimates but may reduce sample efficiency, increase training time per update, and waste compute if not carefully managed.
Batch Size and Training Time
Smaller batches require more optimization steps to process the same amount of data, increasing total training time. In LLM pretraining, batch size is commonly measured in tokens rather than examples to make it independent of input sequence length.
What to Do When Memory Is Insufficient
Scaling batch size will often hit GPU memory limits. To address this, first understand what occupies memory during training:
- Model weights
- Model gradients
- Optimizer states
- Activations used for gradient computation
Memory Dynamics During Training
Memory usage fluctuates during a training step: forward pass increases activations, backward pass accumulates gradients while activations are gradually freed, and the optimization step updates parameters and optimizer states. PyTorch's allocator also does preparatory allocation on the first step, which can make the first step appear successful while later steps fail due to increased memory usage from optimizer state.
Estimating Model Memory Requirements
To manage memory, estimate the memory required by weights, gradients, and optimizer states. For a Transformer-style model, parameter count depends on hidden dimension h, vocabulary size v, and number of layers L. The h^2 term grows rapidly as hidden dimension increases and becomes a dominant factor.
In full precision (FP32) training, each parameter and gradient typically occupies 4 bytes. Optimizers like Adam require additional momentum and variance buffers, adding roughly 8 bytes per parameter for optimizer state. In mixed precision training (for example BF16 compute with an FP32 master copy), compute uses 2 bytes for BF16 while an FP32 copy (4 bytes) of weights is kept for stability. Typical memory components then include:
- Model weights (BF16)
- Gradients (BF16)
- FP32 master copy of weights
- Optimizer state (FP32)
| Model Parameter Count | Full Precision (FP32) | Mixed Precision (BF16 + FP32 master) |
|---|---|---|
| 1 billion parameters | 16 GB | 20 GB |
| 7 billion parameters | 112 GB | 140 GB |
| 70 billion parameters | 1120 GB | 1400 GB |
| 405 billion parameters | 6480 GB | 8100 GB |
Activation Memory: A Major Consumer
Activation memory depends on model structure, input sequence length, and batch size. It often dictates whether a model can be trained on available hardware. Analysis of the Transformer backward pass shows activation memory scales with number of layers L, sequence length seq, batch size bs, hidden dimension h, and number of attention heads n_heads.
Activation Recomputation: Trading Compute for Memory
Activation recomputation stores only key checkpoints and discards other activations, recomputing them as needed during the backward pass. This trades additional compute for reduced memory use. Common strategies include:
- Full recomputation: Recompute every layer's forward pass during backward. This maximizes memory savings but increases compute time by roughly 30-40%.
- Selective recomputation: Recompute only memory-heavy parts such as attention activations while keeping feedforward activations. For example, selective recomputation on a 175B-parameter GPT-3-like model can cut activation memory by about 70% while increasing compute cost by only a few percent.
Gradient Accumulation: Simulating Large Batches with Micro-Batches
Gradient accumulation splits a large global batch into multiple micro-batches. Each micro-batch performs forward and backward passes, and gradients are accumulated. After several micro-batches, the accumulated gradients are used to update parameters, allowing simulation of a larger batch without increasing per-step memory.
Define micro batch size (MBS) as the batch size processed per forward/backward pass and global batch size (GBS) as the total batch size between optimizer steps. If you accumulate over k micro-batches, then GBS = MBS × k. While gradient accumulation reduces memory use, it increases computation per optimizer step. The independent nature of micro-batch computations, however, enables parallelization across GPUs.
Data Parallelism: Scaling with Multiple GPUs
In data parallelism, each GPU processes an independent micro-batch and computes gradients for its model replica. To keep model replicas consistent, gradients are synchronized via an all-reduce operation. All-reduce is the first fundamental distributed communication primitive used to synchronize gradients across GPUs and nodes.
Three Optimizations for Data Parallelism
- Overlap computation and communication: Instead of waiting for the entire backward pass to finish before synchronizing gradients, begin synchronizing gradients of later layers immediately as they become available. In PyTorch, adding an all-reduce hook for each parameter allows immediate reduction when that parameter's gradient is ready, thereby overlapping communication with ongoing backward computation.
- Gradient bucketing: Group small gradients into larger tensors before performing all-reduce. GPUs and network stacks are often more efficient with fewer large transfers than many small transfers, so bucketing reduces communication overhead.
- Combine with gradient accumulation: Delay all-reduce until after accumulating gradients over micro-batches when appropriate, reducing the number of synchronization events and saving communication bandwidth.
Combining Data Parallelism and Gradient Accumulation
Global batch size (GBS) is a key hyperparameter affecting convergence and efficiency. It can be expressed as:
GBS = MBS × GA × DP
where MBS is micro batch size, GA is gradient accumulation steps, and DP is number of data parallel replicas. Practically, maximize data parallel replicas when possible since data parallelism is parallel, while gradient accumulation is sequential. Increase gradient accumulation only when additional data parallel replicas are unavailable.
Determining Training Configuration
- Set target GBS: Determine the desired global batch size in tokens based on literature or experiments.
- Choose sequence length: Typical choices are 2k–8k tokens depending on the task and available compute.
- Find max MBS per GPU: Increase micro batch size until a single GPU runs out of memory.
- Decide GPU count: Based on desired data parallel replicas, compute the required gradient accumulation steps as GBS divided by DP and MBS.
ZeRO: Memory-Efficient Optimizer Sharding
DeepSpeed's ZeRO reduces memory redundancy by sharding optimizer state, gradients, and parameters across data parallel replicas while still allowing full-model training. ZeRO has three stages:
ZeRO-1: Optimizer State Partitioning
ZeRO-1 shards optimizer state across data parallel replicas so that each replica stores only 1/N_d of the optimizer state and updates only its portion during optimization, reducing memory duplication.
ZeRO-2: Add Gradient Partitioning
ZeRO-2 also shards gradients, enabling reduce-scatter in the backward pass instead of all-reduce. Each replica keeps only 1/N_d of the gradients, saving more memory.
ZeRO-3: Add Parameter Partitioning
ZeRO-3 (also known as Fully Sharded Data Parallel, FSDP) additionally shards model parameters. Each replica gathers required parameter shards on demand and releases them after use, minimizing memory usage further.
Tensor Parallelism: Splitting Layers to Overcome Memory Limits
When a single layer cannot fit on one GPU, tensor parallelism partitions model parameters, gradients, optimizer states, and activations across multiple GPUs without requiring full-parameter collection before compute.
Basic Principle of Tensor Parallelism
Tensor parallelism leverages matrix multiplication properties. For a matrix product X × W:
- X is input or activations
- W is layer weights
Partitioning can be done by columns or rows:
Column partitioning
Multiply each column block of W with X independently and concatenate results.
Row partitioning
Multiply row blocks of X with W independently and sum partial results.
Tensor Parallelism in Multi-Head Attention
Multi-head attention uses multiple matrix multiplications (Q, K, V). Tensor parallelism applies naturally:
- Column-parallel: Split Q, K, V columns across GPUs so each GPU computes outputs for one or more attention heads independently.
- Row-parallel: Split output projection rows across GPUs to reduce per-GPU memory for the projection.
Sequence Parallelism
Sequence parallelism splits activations and computation along the input sequence dimension rather than the hidden dimension. This is useful for operations that require the full hidden dimension, such as LayerNorm and Dropout. LayerNorm needs the full hidden dimension to compute mean and variance, so partitioning along sequence length distributes activation memory across GPUs and reduces per-GPU memory demand.
ALLPCB
