Beginner's Guide to Training Large Language Models

Introduction

In the era of rapid artificial intelligence development, large language models (LLMs) are changing how we process and generate language. A recent study ran over 4,000 experiments using up to 512 GPUs to investigate efficient training practices. The study focused on two key metrics: throughput (measured in tokens) and GPU utilization. Both metrics were normalized by model size to compare performance across hardware configurations.

Three Major Challenges in AI Model Training

Memory Usage: Limited Capacity

Memory is a hard constraint during training. If a training step requires more memory than a GPU can provide, the training cannot proceed. Managing complex computations within limited GPU memory is essential.

Compute Efficiency: Keeping Hardware Busy

High-performance GPUs are costly, and ensuring they spend most of their time on computation rather than waiting for data or other GPUs is critical. Idle time caused by data transfers or synchronization reduces overall throughput. The goal is to minimize these wait times and maximize compute utilization.

Communication Overhead: Reducing Internal Overhead

When multiple GPUs collaborate, frequent information exchange can consume time and resources and leave GPUs idle. Efficiently using intra-node (fast) and inter-node (slower) bandwidth, and overlapping communication with computation, reduces communication overhead.

Starting from a Single GPU

Before scaling to thousands of GPUs, begin with single-GPU training. This foundational step is essential for understanding the training process and debugging performance and memory issues.

Three Basic Steps in Single-GPU Training

Forward pass: Feed inputs through the model to obtain outputs.
Backward pass: Compute gradients to identify how the model should change.
Optimization step: Update model parameters using the computed gradients.

Key Hyperparameter: Batch Size

Batch size determines how much data is fed to the model each step and strongly affects training dynamics and efficiency.

Small Batch Sizes

Small batches can help the model discover useful directions early, but gradients tend to be noisy, which may hinder final performance if kept too small.

Large Batch Sizes

Large batches yield more accurate gradient estimates but may reduce sample efficiency, increase training time per update, and waste compute if not carefully managed.

Batch Size and Training Time

Smaller batches require more optimization steps to process the same amount of data, increasing total training time. In LLM pretraining, batch size is commonly measured in tokens rather than examples to make it independent of input sequence length.

What to Do When Memory Is Insufficient

Scaling batch size will often hit GPU memory limits. To address this, first understand what occupies memory during training:

Model weights
Model gradients
Optimizer states
Activations used for gradient computation

Memory Dynamics During Training

Memory usage fluctuates during a training step: forward pass increases activations, backward pass accumulates gradients while activations are gradually freed, and the optimization step updates parameters and optimizer states. PyTorch's allocator also does preparatory allocation on the first step, which can make the first step appear successful while later steps fail due to increased memory usage from optimizer state.

Estimating Model Memory Requirements

To manage memory, estimate the memory required by weights, gradients, and optimizer states. For a Transformer-style model, parameter count depends on hidden dimension h, vocabulary size v, and number of layers L. The h^2 term grows rapidly as hidden dimension increases and becomes a dominant factor.

In full precision (FP32) training, each parameter and gradient typically occupies 4 bytes. Optimizers like Adam require additional momentum and variance buffers, adding roughly 8 bytes per parameter for optimizer state. In mixed precision training (for example BF16 compute with an FP32 master copy), compute uses 2 bytes for BF16 while an FP32 copy (4 bytes) of weights is kept for stability. Typical memory components then include:

Model weights (BF16)
Gradients (BF16)
FP32 master copy of weights
Optimizer state (FP32)

Model Parameter Count	Full Precision (FP32)	Mixed Precision (BF16 + FP32 master)
1 billion parameters	16 GB	20 GB
7 billion parameters	112 GB	140 GB
70 billion parameters	1120 GB	1400 GB
405 billion parameters	6480 GB	8100 GB

Activation Memory: A Major Consumer

Activation memory depends on model structure, input sequence length, and batch size. It often dictates whether a model can be trained on available hardware. Analysis of the Transformer backward pass shows activation memory scales with number of layers L, sequence length seq, batch size bs, hidden dimension h, and number of attention heads n_heads.

Activation Recomputation: Trading Compute for Memory

Activation recomputation stores only key checkpoints and discards other activations, recomputing them as needed during the backward pass. This trades additional compute for reduced memory use. Common strategies include:

Full recomputation: Recompute every layer's forward pass during backward. This maximizes memory savings but increases compute time by roughly 30-40%.
Selective recomputation: Recompute only memory-heavy parts such as attention activations while keeping feedforward activations. For example, selective recomputation on a 175B-parameter GPT-3-like model can cut activation memory by about 70% while increasing compute cost by only a few percent.

Gradient Accumulation: Simulating Large Batches with Micro-Batches

Gradient accumulation splits a large global batch into multiple micro-batches. Each micro-batch performs forward and backward passes, and gradients are accumulated. After several micro-batches, the accumulated gradients are used to update parameters, allowing simulation of a larger batch without increasing per-step memory.

Define micro batch size (MBS) as the batch size processed per forward/backward pass and global batch size (GBS) as the total batch size between optimizer steps. If you accumulate over k micro-batches, then GBS = MBS × k. While gradient accumulation reduces memory use, it increases computation per optimizer step. The independent nature of micro-batch computations, however, enables parallelization across GPUs.

Data Parallelism: Scaling with Multiple GPUs

In data parallelism, each GPU processes an independent micro-batch and computes gradients for its model replica. To keep model replicas consistent, gradients are synchronized via an all-reduce operation. All-reduce is the first fundamental distributed communication primitive used to synchronize gradients across GPUs and nodes.

Three Optimizations for Data Parallelism

Overlap computation and communication: Instead of waiting for the entire backward pass to finish before synchronizing gradients, begin synchronizing gradients of later layers immediately as they become available. In PyTorch, adding an all-reduce hook for each parameter allows immediate reduction when that parameter's gradient is ready, thereby overlapping communication with ongoing backward computation.
Gradient bucketing: Group small gradients into larger tensors before performing all-reduce. GPUs and network stacks are often more efficient with fewer large transfers than many small transfers, so bucketing reduces communication overhead.
Combine with gradient accumulation: Delay all-reduce until after accumulating gradients over micro-batches when appropriate, reducing the number of synchronization events and saving communication bandwidth.

Combining Data Parallelism and Gradient Accumulation

Global batch size (GBS) is a key hyperparameter affecting convergence and efficiency. It can be expressed as:

GBS = MBS × GA × DP

where MBS is micro batch size, GA is gradient accumulation steps, and DP is number of data parallel replicas. Practically, maximize data parallel replicas when possible since data parallelism is parallel, while gradient accumulation is sequential. Increase gradient accumulation only when additional data parallel replicas are unavailable.

Determining Training Configuration

Set target GBS: Determine the desired global batch size in tokens based on literature or experiments.
Choose sequence length: Typical choices are 2k–8k tokens depending on the task and available compute.
Find max MBS per GPU: Increase micro batch size until a single GPU runs out of memory.
Decide GPU count: Based on desired data parallel replicas, compute the required gradient accumulation steps as GBS divided by DP and MBS.

ZeRO: Memory-Efficient Optimizer Sharding

DeepSpeed's ZeRO reduces memory redundancy by sharding optimizer state, gradients, and parameters across data parallel replicas while still allowing full-model training. ZeRO has three stages:

ZeRO-1: Optimizer State Partitioning

ZeRO-1 shards optimizer state across data parallel replicas so that each replica stores only 1/N_d of the optimizer state and updates only its portion during optimization, reducing memory duplication.

ZeRO-2: Add Gradient Partitioning

ZeRO-2 also shards gradients, enabling reduce-scatter in the backward pass instead of all-reduce. Each replica keeps only 1/N_d of the gradients, saving more memory.

ZeRO-3: Add Parameter Partitioning

ZeRO-3 (also known as Fully Sharded Data Parallel, FSDP) additionally shards model parameters. Each replica gathers required parameter shards on demand and releases them after use, minimizing memory usage further.

Tensor Parallelism: Splitting Layers to Overcome Memory Limits

When a single layer cannot fit on one GPU, tensor parallelism partitions model parameters, gradients, optimizer states, and activations across multiple GPUs without requiring full-parameter collection before compute.

Basic Principle of Tensor Parallelism

Tensor parallelism leverages matrix multiplication properties. For a matrix product X × W:

X is input or activations
W is layer weights

Partitioning can be done by columns or rows:

Column partitioning

Multiply each column block of W with X independently and concatenate results.

Row partitioning

Multiply row blocks of X with W independently and sum partial results.

Tensor Parallelism in Multi-Head Attention

Multi-head attention uses multiple matrix multiplications (Q, K, V). Tensor parallelism applies naturally:

Column-parallel: Split Q, K, V columns across GPUs so each GPU computes outputs for one or more attention heads independently.
Row-parallel: Split output projection rows across GPUs to reduce per-GPU memory for the projection.

Sequence Parallelism

Sequence parallelism splits activations and computation along the input sequence dimension rather than the hidden dimension. This is useful for operations that require the full hidden dimension, such as LayerNorm and Dropout. LayerNorm needs the full hidden dimension to compute mean and variance, so partitioning along sequence length distributes activation memory across GPUs and reduces per-GPU memory demand.