Output throughput vs. system throughput

When a benchmark reports output throughput - like the 713 tok/s we measure on gpt-oss-120b - it means tokens streaming to a single request after the first token. This is the number that determines how fast a user receives a complete response. System throughput is different: engineering literature defines it as the number of output tokens per second an inference server generates across all users and requests; benchmarking docs call it "TPS per system" in contrast to "TPS per user." The two move in opposite directions under load: as concurrency rises, system throughput climbs toward hardware saturation while each user's output throughput falls.

Why batching creates the trade-off

The decode phase of inference is bound by memory bandwidth: for every token generated, the hardware must stream the model weights from memory. Batching amortizes that cost - load the weights once, advance many users' requests in the same pass. Anyscale's continuous-batching benchmarks showed up to 23x higher throughput compared to naive request-by-request serving.

The catch: everyone in the batch shares the same memory bandwidth, so a larger batch means slower tokens for each user. Databricks measured the trade-off concretely: at batch size 64 on an A100, throughput rose 14x - while each request's latency rose 4x. System throughput is ultimately the provider's concern - it determines their cost per token and capacity planning. As a user, you experience only your own request's output throughput; the provider's batching policy decides where that lands.

What to look for in practice

When evaluating providers, anchor on the per-request numbers: output throughput and inter-token latency under realistic load - a headline system-throughput figure says nothing about the experience your requests will get. The architecture matters too: hardware that maintains high utilization at low batch sizes can deliver high per-request speed without depending on aggressive batching - which is the design premise of the dataflow architecture our platform runs on.

Sources

Related terms

Tokens per Second

The standard unit for LLM generation speed - and why the same number can mean two different things.

Latency vs. Throughput

The fundamental serving trade-off: total system output vs. each user's speed.

Prefill vs. Decode

The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.

Learn how SambaNova's dataflow architecture changes the economics of inference - and why we built on it.

Throughput (LLM Serving)

Output throughput vs. system throughput

Why batching creates the trade-off

What to look for in practice

Sources

Related terms

Ready to Build the Future of AI in Europe?