Glossary
Serving Concepts

Latency vs. Throughput

Latency and throughput pull in opposite directions in LLM serving: batching more concurrent requests raises total system throughput but slows each individual request. Providers choose where to sit on this curve - and that choice, more than the model itself, often determines the speed users experience.

Where the trade-off comes from

Token generation is bound by memory bandwidth: every decode step must stream the model weights from memory. Batching lets the hardware load weights once and advance many requests together, so total tokens per second rises steeply with batch size. But all requests in the batch share the same bandwidth, so each user's tokens arrive more slowly. Databricks measured it concretely on an A100: batch size 64 delivered 14x the throughput at 4x the per-request latency.

The trade-off has a hard edge: once batches grow large enough that decode becomes compute-bound, throughput stops improving while latency keeps degrading - in Databricks' words, every doubling of batch size beyond that point just increases latency. Research systems like Sarathi-Serve (OSDI 2024) exist specifically to manage this curve, because naive scheduling lets one user's prefill stall every other user's generation.

What this means when choosing a provider

Two providers running identical models on identical GPUs can deliver completely different experiences depending on how aggressively they batch. High utilization is good for the provider's economics; low latency is good for your users. Better scheduling (continuous batching, chunked prefill) moves the frontier outward - and different hardware changes its shape entirely: architectures that stay efficient at low batch sizes can offer high per-request speed without sacrificing as much capacity, which is the premise of the dataflow architecture behind our platform.

Practical advice: benchmark providers under your real workload and concurrency, not just single requests at midnight. Watch inter-token latency stability across the day - it reveals how oversubscribed the capacity actually is.

Sources

Related terms

See these metrics measured live on our EU infrastructure - real numbers from production hardware, independently verified.

Ready to Build the Future of AI in Europe?

Join forward-thinking organizations deploying sovereign AI with world-class performance