The anatomy of a response

Every LLM response has two phases. First the model processes your entire prompt in one parallel pass (prefill) - this determines time to first token. Then it generates output one token at a time (decode) - the speed of this phase is the inter-token latency, usually reported as output tokens per second. Total response time is approximately TTFT plus the number of generated tokens multiplied by the time per token.

The phases stress hardware differently: prefill is typically compute-bound, while decode is typically bound by memory bandwidth at common batch sizes - for each new token, the hardware must move the model weights from memory. This is why the same hardware can have excellent prefill performance and mediocre generation speed.

Which metric matters for which workload

For interactive chat, TTFT dominates perception - users notice the silent gap before output starts far more than the streaming speed. For voice agents, both matter and budgets are tight. For agentic workloads - coding agents, tool-calling pipelines, autonomous workflows - output speed dominates: the agent must receive every token of every step before it can act, so generation speed compounds across the whole chain.

On our Munich infrastructure we publish all three numbers per model: for gpt-oss-120b, 388 ms TTFT, 713 tok/s output throughput, and 1.789 s end-to-end for a 10,000-token input / 1,000-token output request (server-side p50).

Measuring it honestly

Speed numbers are only comparable when the workload is stated: prompt length changes TTFT, output length changes the TTFT/generation balance, and concurrency changes everything. End-to-end latency additionally includes contributors beyond the model - network round-trips (distance to the datacenter matters), gateway overhead, and queue time on shared capacity - so client-measured numbers always differ from server-side ones. Independent benchmarks like Artificial Analysis publish their exact workloads (1k and 10k input-token tests, measured 8 times daily, reported as 72-hour medians) - the standard our own published benchmarks follow.

Sources

Related terms

TTFT (Time to First Token)

How long a user waits between sending a request and seeing the first token of the response.

Inter-Token Latency (ITL)

The average time gap between consecutive tokens during generation - also called TPOT.

Tokens per Second

The standard unit for LLM generation speed - and why the same number can mean two different things.

Prefill vs. Decode

The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.

Parameters

A model's learned weights - the rough measure of its size and capacity, and the direct driver of its memory, speed, and cost.

See these metrics measured live on our EU infrastructure - real numbers from production hardware, independently verified.

Inference Speed

The anatomy of a response

Which metric matters for which workload

Measuring it honestly

Sources

Related terms

Ready to Build the Future of AI in Europe?