Why A100/H100 Beat RTX 4090 for Large-Model Training

Summary

Short answer: using an RTX 4090 for training large models is not practical, while using a 4090 for inference is feasible and can be cost-effective compared with H100. With extreme optimization, a 4090-based inference setup can approach or exceed the cost efficiency of H100 systems. The primary differences between H100/A100 and 4090 are memory capacity, memory bandwidth, and inter-GPU communication, not raw compute alone.

Key GPU Specifications

	H100	A100	RTX 4090
Tensor FP16 peak	989 Tflops	312 Tflops	330 Tflops
Tensor FP32 peak	495 Tflops	156 Tflops	83 Tflops
Memory capacity	80 GB	80 GB	24 GB
Memory bandwidth	3.35 TB/s	2 TB/s	1 TB/s
Interconnect bandwidth	900 GB/s	900 GB/s	64 GB/s
Interconnect latency	~1 us	~1 us	~10 us
List price	40000	$15000	$1600

On Peak Numbers and Pricing

Vendor peak numbers can be misleading. For example, H100's TF16 number often quoted as 1979 Tflops assumes sparsity; the dense peak is roughly half. RTX 4090 marketing may state a very high Tensor Core number for int8, while FP16 is 330 Tflops. Using incorrect specs leads to incorrect conclusions.

Market prices for H100 include substantial margins compared with estimated manufacturing cost. Historical examples show large customers can negotiate favorable pricing with fabs. It is reasonable to assume significant markup in enterprise GPU pricing compared with commodity consumer GPUs.

Why 4090 Is Unsuitable for Large-Model Training

Training vs Inference: the communication and memory problem

Training large transformer models requires not only high compute but also high memory capacity, high memory bandwidth, and very high inter-GPU communication bandwidth with low latency. The 4090 is competitive in raw FP16 compute per card, but it is far behind in memory capacity, memory bandwidth, and especially in interconnect bandwidth and latency, which are critical for multi-GPU training.

Throughput and cost trade-offs

Single-node throughput benchmarks show H100 achieves higher throughput than 4090 on workloads that fit the device, roughly up to 2x in some cases. However, when factoring list price, the cost-effectiveness of H100 versus 4090 can look very different: H100 is much more expensive, which hurts dollars-per-throughput metrics.

How much compute does training need?

Estimate for total training FLOPs: total FLOPs = 6 * model parameter count * number of training tokens. The factor 6 accounts for the multiplications and additions across forward and backward passes (3 multiplies and 3 adds per connection), which is a simplified but common approximation for rough capacity planning.

Example: LLaMA-2 70B training

Open data indicates training LLaMA-2 70B requires about 1.7M A100 GPU-hours. That translates to decades on a single GPU and implies thousands of GPUs to train within acceptable wall-clock time. A 4090 has similar FP16 peak to A100, but memory bandwidth is about half and memory capacity is far smaller (24 GB vs 80 GB), so 4090 single-card training throughput is slightly lower than A100 in practice.

Parallelism Strategies and Their Limits

Model parallelism typically combines tensor parallelism, pipeline parallelism, and data parallelism. The product of these parallel degrees gives the total GPU count for a training job.

Data parallelism

Data parallelism divides mini-batches across GPUs. It requires gradient synchronization across devices, which is manageable if the model fits on each GPU. For large models that exceed single-GPU memory, data parallelism alone is insufficient.

Memory requirements during training

Training memory must hold model parameters, gradients, optimizer states, and forward activations. For Adam, optimizer state per parameter may require 12 bytes (FP32 param copy, momentum, variance). Activations scale with batch size and sequence length. Techniques like activation recomputation trade compute for memory by recomputing forward activations on demand, but recomputation increases compute cost.

Pipeline parallelism

Pipeline parallelism partitions layers across GPUs. Without careful batching, many GPUs remain idle at any one time. Pipelining with micro-batches can increase utilization, but it increases activation storage because activations from multiple in-flight micro-batches must be retained until backward passes. More pipeline stages increases both memory pressure and communication volume and latency. Therefore, when memory permits, fewer pipeline stages are preferable.

Tensor parallelism

Tensor parallelism partitions single-layer computations (e.g., attention heads, dense matmuls) across GPUs. It reduces per-GPU memory requirement and can lower the number of pipeline stages. However, tensor parallelism increases inter-GPU communication because partial outputs must be exchanged and reduced. The ratio of compute to communication determines viability.

For example, if tensor parallelism splits attention heads so each GPU computes only a subset, communication per layer depends on batch size, sequence length, embedding size, and number of GPUs. For RTX 4090, the compute-to-communication ratio means that to avoid being communication-bound, tensor parallel groups must be very small (in the numeric example given, at most 2 GPUs). By contrast, H100's much higher interconnect bandwidth and compute allow larger tensor-parallel groups (e.g., up to 11 GPUs in the example), making 8-GPU single-node training practical for large embedding sizes.

This is why reduced-bandwidth variants like an H800 (with lower NVLink bandwidth) can be intentionally constrained to make some large-parallel training modes inefficient.

Why RTX 4090 Is Attractive for Inference

Training vs inference resource needs

Inference does not require storing gradients or optimizer states and does not need to preserve forward activations for backward passes. Therefore memory capacity and interconnect latency/throughput requirements are lower for inference than for training, though KV cache and activation storage for generation can still be substantial.

KV cache and its impact

KV cache stores per-layer K and V tensors for previously processed tokens to avoid recomputing them for autoregressive generation. KV cache trades additional memory for large compute savings, particularly when serving long context windows and many output tokens. For LLaMA-2 70B with sequence length 4096 and batch size 8, the KV cache across 80 layers can require on the order of 80 GB. If batch size increases, KV cache memory grows proportionally and can exceed parameter storage.

Compute vs memory bandwidth for inference

Inference compute is roughly 2 * output_token_count * parameter_count FLOPs. However, memory bandwidth is often the bottleneck: each parameter read is a few bytes and yields only limited compute unless the context length or batching is large. For RTX 4090, compute-to-bandwidth ratio is 330 (Tflops/TB/s), meaning if the effective arithmetic intensity (tokens per parameter read) is below ~330, inference becomes memory-bound. For H100 the break-even arithmetic intensity is around 295. Batching multiple prompts together increases arithmetic intensity and shifts the bottleneck from memory to compute.

How many cards to serve LLaMA-2 70B?

Model parameters for 70B are about 140 GB, so single-card GPUs (24 GB or 80 GB) cannot hold the whole model without multi-GPU solutions. For H100 (80 GB), at least 3 GPUs are needed to hold parameters plus space for KV cache. For 4090 (24 GB), around 8 GPUs are required to accommodate parameters and a useful KV cache example.

Pipeline vs tensor parallelism for inference

Pipeline parallelism increases single-prompt latency because the prompt is processed sequentially across stages, although pipelining multiple prompts reduces GPU idle time. With small batch sizes, memory bandwidth is often the bottleneck and pipelining can produce poor single-prompt throughput. With larger batch sizes, compute becomes the bottleneck and pipelining can be more effective, but communication between hosts can limit throughput unless interconnects are fast.

Tensor parallelism increases inter-GPU data exchange, but for inference the communicated payload is activations rather than gradients, and activation sizes are typically much smaller than gradient traffic during training. For small batch sizes and short contexts tensor-parallel latency can be acceptable even on PCIe-based systems, whereas for large batches or long contexts, NVLink or high-bandwidth interconnects reduce latency significantly.

Performance examples

Using the example numbers: 8x 4090 in a tensor-parallel configuration with batch size = 1 could produce around 45 tokens per second for a single prompt when communication overhead and memory bandwidth are considered. Increasing batch size to saturate compute (e.g., batch size ~330 for 4090) yields higher throughput but raises communication volumes and KV cache memory needs. H100 systems with high NVLink bandwidth and higher memory bandwidth produce much lower per-token latency and higher tokens-per-second for single prompts; for instance, an 8x H100 system can reach much higher generation rate per prompt and can be many times faster per prompt than 8x 4090 in latency-sensitive scenarios.

Cost and Operational Considerations for Inference

Example cost comparison for inference infrastructure:

8x 4090 server: GPU cost about $12,800, additional server, networking, and integration may bring total device cost to roughly $40,000. Depreciated over 3 years, capital cost is around $1.52 per hour. Power and hosting add additional cost; a fully loaded rack could realize relatively low per-hour costs. With ideal utilization and efficient batching, cost per token can be very low in theory, but utilization and workload variability reduce realized efficiency.
8x H100 server: hardware and networking costs are much higher; a high-end 8x H100 server with IB network can cost ~$300,000. Depreciated over 3 years, capital cost is higher per hour. H100 power draw and hosting constraints also differ. However, higher per-card performance and interconnect bandwidth allow significantly higher throughput per server, so tokens-per-dollar can be similar to a 4090 cluster depending on utilization and workload characteristics.

In practical terms, due to higher per-card compute and much higher interconnect bandwidth, an H100 cluster can deliver substantially more throughput per node. When accounting for server, network, power, and hosting costs, the overall cost-per-token for 8x H100 may be within a factor of two of 8x 4090, despite a much larger upfront GPU price gap.

Designing Cost-Effective Inference Clusters

There are architecture choices to balance cost and performance when using consumer GPUs like 4090s for inference:

Pipeline parallelism across hosts with modest network bandwidth (e.g., 10 Gbps) can be used to assemble budget clusters for tolerant latency targets, leveraging inexpensive desktop hardware and switches.
Tensor parallelism within hosts combined with faster host-to-host links (e.g., 200 Gbps RoCE) can reduce per-prompt latency and increase throughput, at higher network hardware cost.
Hybrid approaches combining intra-host tensor parallelism and inter-host pipeline parallelism can balance communication and latency trade-offs while using commodity components.

With careful engineering and lower-cost hosting, 4090-based inference clusters can achieve impressive tokens-per-dollar, but they generally require more servers, more power, and more engineering effort than purpose-built H100 systems.

Broader Remarks

There are many GPU options beyond A100/H100 and consumer 4090 cards, including datacenter cards like A10/A40 and consumer or prosumer models like 3090, as well as offerings from other vendors. In many cases, alternatives can offer better price/performance for specific workloads.

Quantization, model compression, and software-level optimizations can significantly change the hardware cost calculus. On-device unified memory with sufficiently high bandwidth also changes trade-offs and may enable single-device inference in more cases.

Licensing Note

NVIDIA GeForce driver license restricts datacenter deployment of GeForce software: "No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted." Deploying large consumer-GPU clusters should consider vendor license terms and compliance obligations.

Closing Observations

Training and inference have different bottlenecks. Training of very large models favors GPUs with large memory capacity, high memory bandwidth, and high-bandwidth, low-latency interconnects. Consumer GPUs can be cost-effective for inference with careful system design, batching, and KV cache management. Ultimately, total cost is governed by capital, power, network, hosting, and software efficiency, and future improvements in hardware cost, power efficiency, and network infrastructure will continue to change the economics of large-model deployment.