Popular LLM Inference Stacks and Setups

Overview

Choosing the right LLM inference stack means selecting a model appropriate for your task and running it with suitable inference code on appropriate hardware. This article summarizes popular LLM inference stacks and setups, explains the cost components of inference, discusses current open models and how to use them effectively, and highlights missing features in today's open-source service stacks as well as new capabilities future models may enable.

The material is based on a talk by Mistral AI CTO Timothée Lacroix. Lacroix previously worked as an engineer at Facebook AI Research in 2015, collaborated with école des Ponts on tensor factorization for recommendation systems between 2016 and 2019, and co-founded Mistral AI in 2023. Mistral AI recently released the open MoE model Mixtral-8x7B.

Much of the talk builds on publicly available information and experiments performed on early LLaMA models. The focus here is on inference costs rather than training costs, covering cost components, throughput, latency, and influencing factors.

Key Inference Metrics

Important metrics include throughput, latency, and cost. Throughput is measured in queries per second and is what you maximize for batch workloads or to serve more users. Latency is often expressed per token in seconds/token and determines responsiveness. For interactive applications, lower latency is better. A practical target is outputting about 250 words per minute, which corresponds approximately to human reading speed; keeping latency below that helps avoid user frustration. Cost should be minimized.

Factors Affecting Inference Metrics

This discussion focuses on autoregressive decoding, i.e., producing the next tokens from batches of tokens via a neural network. It does not cover prompt processing, sometimes called the prefill stage, which typically runs once per request and is already heavily optimized.

Consider a model of size P, for example P = 7B. One decoding step requires roughly 2 x P x Batch_size FLOPs. During these FLOPs, the entire model must be loaded into the GPU performing the computation; the memory movement required is roughly equal to the number of model parameters because the whole model must be resident.

Two quantities are important. The first is limited by the hardware floating-point throughput, which scales linearly with batch size. The second is the model size divided by memory bandwidth: this is the one-time minimal time to load the model into memory. There is also an intermediate interaction point between floating-point capability and memory bandwidth that defines a critical batch size B*. Below B*, compute resources are wasted because computation becomes memory-bound; above B*, the system becomes compute-bound and latency grows with batch size.

For some available GPUs like A10G and A100, this intersection leads to a B* on the order of hundreds. B* is valuable because it gives the batch size that yields the best latency/efficiency tradeoff without wasting FLOPs.

To estimate numbers for LLaMA-like models: a LLaMA 7B has model dimensions around 4K and depth 32. In FP16 each parameter takes 2 bytes, so the model weights alone require on the order of 14 GB. The KV cache stores K and V for each layer in FP16 and must be stored per batch element and per sequence position. For a max sequence length of 4K, the KV cache requires roughly 2 GB per batch element. On an A10 with 24 GB memory the max batch size is roughly 5; on an A100 with 80 GB memory the max batch size is about 33. Both are far below the ideal B* of several hundred.

Therefore, for practical use cases, decoding with a 7B model is severely memory-bandwidth bound. The memory footprint of the model plus KV cache limits the allowed maximum batch size, and that maximum batch size directly determines efficiency.

Practical Techniques

Below are practical techniques that can improve inference performance. Some are already used by Mistral, some are implemented elsewhere, and some are primarily software-level optimizations.

Grouped Query Attention

Grouped query attention reduces KV cache size by associating fewer keys and values per query. Instead of one key/value per query as in standard multi-head attention, a group of queries shares a pair of key and value. For example, using one KV pair for four queries reduces memory usage by 4x while keeping FLOPs similar. This approach has been used in larger LLaMA 2 variants and is effective for reducing memory overhead without significant accuracy loss.

Quantization

Quantization transforms weights to lower bit widths such as int8 or int4. Int8 halves model size; int4 reduces it to one quarter. Quantization does not change the optimal batch size, which depends on hardware, but it increases effective throughput and reduces KV cache usage. In practice, speedups of around 1.5x are commonly observed versus the theoretical 2x for int8. Int8 generally introduces minimal accuracy loss; int4 may require recovery techniques such as QLoRA or task-specific fine-tuning. Quantization is a practical way to reduce serving cost when memory is constrained.

Paged Attention

Paged attention, proposed by the vLLM team, avoids allocating one large rectangular KV cache where one dimension is batch size and the other is max sequence length. That rectangular allocation wastes memory because many users send short prompts. Paged attention allocates memory in blocks inside GPU memory, filling free regions with blocks that hold a small number of tokens, e.g., 16 to 32. New sequences get blocks assigned and can grow as needed; completed sequences free their blocks. This block-based allocation increases the effective utilization of GPU memory and can significantly improve throughput in multi-user scenarios.

Sliding Window Attention

Sliding window attention limits the cache to the most recent K tokens, allowing a fixed-size cache that can be implemented as a circular buffer. Once a sequence exceeds the window length, older positions are overwritten. Position information is preserved using positional embeddings, so circular overwriting does not harm model behavior. This enables fixed memory usage while supporting longer context via training or architecture considerations.

Continuous Batching

Continuous batching addresses the imbalance between prefill and decoding. Prefill often processes far more tokens at once, and some systems send full prefill requests for long prompts, increasing latency for everyone. Instead, chunk prefill into smaller parts, processing only K tokens at a time, which allows better interleaving of prefill and decode workloads and improves latency and resource allocation.

Code and Kernel-Level Optimizations

Software overheads, especially Python-level overhead, can be significant at these model scales. Techniques include using CUDA graphs to eliminate launch overhead, using inference runtimes like TensorRT that trace and optimize patterns, and creating custom fused kernels to reduce memory bandwidth by performing multiple operations without writing intermediate data back to memory. Libraries such as xFormers provide examples of low-overhead implementations. These optimizations reduce latency and improve throughput by lowering CPU/GPU coordination costs and memory movement.

Overall, the primary drivers of performance are the ratio between fixed floating-point compute and memory bandwidth on the hardware. That ratio defines the minimal batch size B* required to fully utilize compute without being memory-bound. Achieving B* is often difficult due to limited device memory, and open-source stacks sometimes incur additional overhead due to Python or other layers. Projects like FasterTransformer remove much overhead but can be harder to deploy.

Throughput, Latency, and Cost Tradeoffs

Throughput-latency plots are useful for evaluating setups. The x axis represents latency, the y axis throughput. The desirable region is upper-left: high throughput and low latency. For a fixed hardware configuration, increasing batch size moves the operating point from a memory-limited flat-latency region into a compute-limited region where latency grows with batch size. Upgrading hardware shifts the entire curve toward lower latency and higher throughput but at higher cost.

Software and model improvements matter most in the low-latency region, where they can increase throughput without much effect on very large-batch scenarios where optimization is already easier.

In short tests using Mistral and LLaMA configurations with vLLM benchmarking scripts, deploying small open models on small instances is straightforward and yields good results without extensive engineering. Running a Mistral-7B model on an A10 can handle millions of requests at a modest daily cost. Changing model precision can significantly increase request throughput.

Q&A Highlights

How to choose a processor for a given model?

Timothée Lacroix: I have mainly tested a range of GPUs rather than dedicated AI accelerators. For interactive single-user workloads, running locally on a laptop can be economical. For high-volume scenarios, A10 provides a good cost-performance point and is easy to deploy. Start with lower-cost hardware and scale up if throughput or latency requirements are not met. H100 can be more cost-effective for some workloads, but availability can be a limiting factor. I recommend trying candidate processors for a short benchmarking run to measure cost and performance for your specific use case.

Is Mojo recommended to reduce Python overhead? Any experience?

Timothée Lacroix: I have not tried Mojo. My first approach to reduce overhead was CUDA graphs, which can be tricky to debug but are effective. xFormers shows how CUDA graphs can remove overhead. In the future, tools like torch.compile might help reduce Python overhead, but their behavior on variable-length sequences is still evolving. For now, CUDA graphs are my preferred method to lower runtime overhead.

How to improve multilingual understanding when training data is mostly English?

Timothée Lacroix: Model capabilities come from data. To improve performance in a target language, obtain training data in that language. All LLMs benefit from multilingual corpora such as Wikipedia, which gives a baseline multilingual ability. Fine-tuning with non-English data can improve performance in a specific language, although there are tradeoffs where gains in one language may slightly affect others. In many cases, the overall benefit for the target language outweighs small regressions elsewhere.