Where the trade-off comes from

Token generation is bound by memory bandwidth: every decode step must stream the model weights from memory. Batching lets the hardware load weights once and advance many requests together, so total tokens per second rises steeply with batch size. But all requests in the batch share the same bandwidth, so each user's tokens arrive more slowly. Databricks measured it concretely on an A100: batch size 64 delivered 14x the throughput at 4x the per-request latency.

The trade-off has a hard edge: once batches grow large enough that decode becomes compute-bound, throughput stops improving while latency keeps degrading - in Databricks' words, every doubling of batch size beyond that point just increases latency. Research systems like Sarathi-Serve (OSDI 2024) exist specifically to manage this curve, because naive scheduling lets one user's prefill stall every other user's generation.

What this means when choosing a provider

Two providers running identical models on identical GPUs can deliver completely different experiences depending on how aggressively they batch. High utilization is good for the provider's economics; low latency is good for your users. Better scheduling (continuous batching, chunked prefill) moves the frontier outward - and different hardware changes its shape entirely: architectures that stay efficient at low batch sizes can offer high per-request speed without sacrificing as much capacity, which is the premise of the dataflow architecture behind our platform.

Practical advice: benchmark providers under your real workload and concurrency, not just single requests at midnight. Watch inter-token latency stability across the day - it reveals how oversubscribed the capacity actually is.

Sources

Related terms

Throughput (LLM Serving)

Tokens per second in two senses: per-request output throughput vs. system-wide capacity - and how batching trades one against the other.

Inter-Token Latency (ITL)

The average time gap between consecutive tokens during generation - also called TPOT.

Dataflow Architecture

The execution model where data streams through operations as a pipeline - eliminating the kernel-by-kernel round trips of GPU execution.

See these metrics measured live on our EU infrastructure - real numbers from production hardware, independently verified.

Latency vs. Throughput

Where the trade-off comes from

What this means when choosing a provider

Sources

Related terms

Ready to Build the Future of AI in Europe?