Overview
Recently, ByteDance and a research team from Peking University published a paper titled "Scaling Large Language Model Training Beyond 10,000 GPUs", which describes a production system for training large language models and addresses efficiency and stability challenges encountered when training on clusters with tens of thousands of GPUs. The paper details system design, implementation, deployment, and the issues and solutions found at cluster scales above 10,000 GPUs.
1. Two main challenges for 10,000+ GPU clusters
In the large-model era, massive compute resources are required: model size and training data size are key factors for model capability. Major players build clusters with tens of thousands of GPUs to train large language models. When GPU clusters reach the 10,000+ scale, how can efficient and stable training be achieved?
The first challenge is achieving high-efficiency training at large scale. Model FLOP utilization (MFU) is the ratio of actual throughput to theoretical maximum throughput and is a common metric for training efficiency. To train LLMs, models must be partitioned across many GPUs, requiring extensive communication between GPUs. Besides communication, factors such as operator optimizations, data preprocessing, and GPU memory consumption also significantly affect MFU.
The second challenge is maintaining training stability at large scale, i.e., preserving high efficiency throughout the run. Failures and slowdowns are common in large-model training but their cost is high. Reducing fault-recovery time is critical because a single straggler can slow an entire job running on tens of thousands of GPUs. To address these challenges, ByteDance developed the MegaScale system and deployed it in its data centers.
2. How to achieve efficient large-model training
Handling sharply increased compute demands without degrading model quality requires algorithmic optimizations, communication strategies, data pipeline management, and network tuning. The following sections summarize the methods used to optimize large-model training for high efficiency at scale.
Algorithmic optimizations
Algorithm-level changes improved training efficiency without affecting accuracy. Key techniques include a parallel transformer block, sliding window attention (SWA), and the LAMB optimizer.
Parallel transformer blocks replace the standard serialized formulation so that attention and MLP computations can be executed in parallel, reducing compute time. Prior work indicates this modification does not reduce the quality of models with hundreds of billions of parameters.
Sliding window attention (SWA) is a sparse attention mechanism that applies a fixed-size window around each token, which is more efficient than full self-attention. By stacking such window attention layers, the model captures wide-context information while creating a large receptive field, speeding training without degrading accuracy.
The LAMB optimizer enables large batch sizes during training. Increasing batch size often harms convergence, but LAMB allows BERT-scale training batch sizes up to 64K without accuracy loss.
Communication overlap in 3D parallelism
3D parallelism refers to tensor, pipeline, and data parallelism. Data parallelism involves two main collective operations: all-gather and reduce-scatter. In 3D parallelism, a single device may host multiple model blocks. Overlap is implemented at the model-block level to maximize bandwidth utilization: some all-gather operations are triggered before a block's forward pass and some reduce-scatter operations start after its backward pass. This leaves the first all-gather and the final reduce-scatter less overlapped. Inspired by PyTorch FSDP, initial all-gather operations are prefetched at the start of each iteration, allowing overlap with data loading and effectively reducing communication time.
In pipeline parallelism, MegaScale uses an interleaved 1F1B schedule to enable communication overlap. During warm-up, forward passes depend only on previous receives. The usual send-receive dependency is decoupled so that send operations can overlap with computation. For tensor/sequence parallelism, optimizations include fusing communication and computation and splitting GEMM kernels into smaller tiles that are executed in a pipeline with communication.
Efficient operators
Although GEMM operators in Megatron-LM were optimized, other operators offered further gains. The attention implementation uses FlashAttention-2, which improves workload distribution across thread blocks and warps. LayerNorm and GeLU were initially implemented as fine-grained kernels; fusing these kernels reduces launch overhead and improves memory access patterns, yielding better performance.
Data pipeline optimizations
Data preprocessing and loading are often overlooked but cause non-negligible GPU idle time at the start of each training step. Optimizing these operations is essential for overall training efficiency.
Asynchronous data preprocessing is employed because preprocessing is not on the critical path. While GPU workers synchronize gradients at the end of each step, preprocessing for subsequent steps can start, hiding preprocessing overhead.
Redundant data loaders are eliminated. In typical distributed training, each GPU worker runs its own data loader that reads training data into CPU memory and then to GPU, causing contention for disk bandwidth. In LLM settings, GPU workers on the same machine often belong to the same tensor-parallel group and therefore use essentially the same inputs each iteration. Leveraging this, a two-level tree approach is used: one dedicated data loader per machine reads training data into shared memory, and each GPU worker then copies required data from shared memory to its GPU. This removes redundant reads and improves data transfer efficiency.
Collective communication group initialization
Distributed training initialization establishes NCCL communication groups between GPU workers. At small scales, torch.distributed is acceptable, but at scales beyond 10,000 GPUs, naive implementations introduce intolerable overhead. Two issues were identified in torch.distributed initialization. First, synchronization steps use a TCPStore-based barrier with single-threaded, blocking reads and writes; replacing TCPStore with a non-blocking, asynchronous Redis reduces this overhead. Second, careless use of global barriers causes each process to perform a global barrier after initializing its communication groups. By designing the communication-group initialization order to minimize global barriers, time complexity is reduced. For example, an unoptimized 2,048-GPU cluster took 1,047 seconds to initialize; after optimization this fell below 5 seconds, and initialization for a 10,000+ GPU cluster can be reduced to under 30 seconds.
Network performance tuning
The team analyzed inter-machine traffic in 3D parallelism and designed techniques to improve network performance, including topology design, reducing ECMP hash collisions, congestion control, and retransmission timeout settings.
Network topology. The data center network uses high-performance switches based on the Tomahawk 4 silicon. Each Tomahawk chip offers 25.6 Tbps total bandwidth with 64×400 Gbps ports. A three-layer CLOS-like topology connects over 10,000 GPUs. The downlink-to-uplink bandwidth ratio at each switch layer is 1:1, providing a low-diameter, high-bandwidth fabric where each node communicates with others within a limited number of hops.
Reducing ECMP hash collisions. The network topology and flow scheduling were designed to reduce ECMP hash collisions. On rack ToR switches, uplinks and downlinks are separated; one 400G downlink port is split into two 200G downlink ports via a specific AOC cable, effectively reducing collision rates.
Congestion control. At large scale, default DCQCN can produce congestion and increased use of PFC during all-to-all communication. Excessive PFC can cause head-of-line blocking, reducing throughput. To mitigate this, an algorithm combining Swift and DCQCN principles was developed that pairs accurate round-trip time measurements with fast ECN-based congestion responses. This approach substantially improves throughput while minimizing PFC-related congestion.
Retransmission timeout tuning. NCCL parameters controlling retransmission timers and retry counts were adjusted to enable quick recovery from link jitter. Additionally, adapter-level adaptive retransmission was enabled on NICs to support retransmissions at shorter intervals, which helps rapid recovery when jitter cycles are short.
3. Fault tolerance
At cluster scales of tens of thousands of GPUs, software and hardware failures are nearly inevitable. A robust training framework was designed for LLM training to enable automatic fault detection and rapid recovery with minimal human intervention and minimal impact on ongoing training jobs.
Upon receiving a training job, a driver process interacts with a custom Kubernetes interface to allocate compute resources and launch a pod for each executor. Each executor manages one node. After initialization, executors spawn training processes on each GPU and start a monitoring daemon that periodically sends heartbeats to the driver for real-time anomaly detection and alerting. When an anomaly is detected or a status report is not received within a timeout, a recovery procedure is triggered: all running training tasks pause and perform self-check diagnostics. Once faulty nodes are identified, the driver submits the IPs of blocked nodes and pod information to Kubernetes, which evicts the faulty nodes and replaces them with healthy ones. A UI is also available to manually remove problematic nodes. After recovery, the driver resumes training from the latest checkpoint. Checkpointing and recovery were optimized to minimize training progress loss. A millisecond-resolution monitoring system was developed to track various metrics at different levels for stability and performance monitoring. The paper also describes fast checkpoint recovery, training troubleshooting, and MegaScale deployment and operational experience; the paper is available for download.
4. Conclusion
The paper examines MegaScale's design, implementation, and deployment. Through joint algorithm-system design, MegaScale optimizes training efficiency. When training a 175B-parameter LLM on 12,288 GPUs, MegaScale achieved 55.2% MFU, a 1.34× improvement compared with Megatron-LM.
Fault tolerance throughout the training lifecycle is emphasized, and a custom robust training framework was implemented to automatically locate and repair failures. A comprehensive monitoring toolset was provided to observe system components and events for root-cause analysis of complex anomalies. The work offers practical insights for large-scale LLM training and informs future research in this area.