Glossary
Serving Concepts

Throughput (LLM Serving)

In LLM serving, throughput measures tokens produced per second - in two distinct senses. Output throughput (also single-stream or per-request throughput) is how fast tokens stream to one user, the sense used in our benchmarks. System throughput is the total across all concurrent users combined, the sense that drives serving capacity and economics.

Output throughput vs. system throughput

When a benchmark reports output throughput - like the 713 tok/s we measure on gpt-oss-120b - it means tokens streaming to a single request after the first token. This is the number that determines how fast a user receives a complete response. System throughput is different: engineering literature defines it as the number of output tokens per second an inference server generates across all users and requests; benchmarking docs call it "TPS per system" in contrast to "TPS per user." The two move in opposite directions under load: as concurrency rises, system throughput climbs toward hardware saturation while each user's output throughput falls.

Why batching creates the trade-off

The decode phase of inference is bound by memory bandwidth: for every token generated, the hardware must stream the model weights from memory. Batching amortizes that cost - load the weights once, advance many users' requests in the same pass. Anyscale's continuous-batching benchmarks showed up to 23x higher throughput compared to naive request-by-request serving.

The catch: everyone in the batch shares the same memory bandwidth, so a larger batch means slower tokens for each user. Databricks measured the trade-off concretely: at batch size 64 on an A100, throughput rose 14x - while each request's latency rose 4x. System throughput is ultimately the provider's concern - it determines their cost per token and capacity planning. As a user, you experience only your own request's output throughput; the provider's batching policy decides where that lands.

What to look for in practice

When evaluating providers, anchor on the per-request numbers: output throughput and inter-token latency under realistic load - a headline system-throughput figure says nothing about the experience your requests will get. The architecture matters too: hardware that maintains high utilization at low batch sizes can deliver high per-request speed without depending on aggressive batching - which is the design premise of the dataflow architecture our platform runs on.

Sources

Related terms

Learn how SambaNova's dataflow architecture changes the economics of inference - and why we built on it.

Ready to Build the Future of AI in Europe?

Join forward-thinking organizations deploying sovereign AI with world-class performance