Glossary
Performance Metrics

Inference Speed

Inference speed describes how fast an LLM system turns a request into a complete response. It is not a single number: it decomposes into time to first token (TTFT), inter-token latency (ITL), and end-to-end latency - and which metric matters depends on whether a human or a machine consumes the output.

The anatomy of a response

Every LLM response has two phases. First the model processes your entire prompt in one parallel pass (prefill) - this determines time to first token. Then it generates output one token at a time (decode) - the speed of this phase is the inter-token latency, usually reported as output tokens per second. Total response time is approximately TTFT plus the number of generated tokens multiplied by the time per token.

The phases stress hardware differently: prefill is typically compute-bound, while decode is typically bound by memory bandwidth at common batch sizes - for each new token, the hardware must move the model weights from memory. This is why the same hardware can have excellent prefill performance and mediocre generation speed.

Which metric matters for which workload

For interactive chat, TTFT dominates perception - users notice the silent gap before output starts far more than the streaming speed. For voice agents, both matter and budgets are tight. For agentic workloads - coding agents, tool-calling pipelines, autonomous workflows - output speed dominates: the agent must receive every token of every step before it can act, so generation speed compounds across the whole chain.

On our Munich infrastructure we publish all three numbers per model: for gpt-oss-120b, 388 ms TTFT, 713 tok/s output throughput, and 1.789 s end-to-end for a 10,000-token input / 1,000-token output request (server-side p50).

Measuring it honestly

Speed numbers are only comparable when the workload is stated: prompt length changes TTFT, output length changes the TTFT/generation balance, and concurrency changes everything. End-to-end latency additionally includes contributors beyond the model - network round-trips (distance to the datacenter matters), gateway overhead, and queue time on shared capacity - so client-measured numbers always differ from server-side ones. Independent benchmarks like Artificial Analysis publish their exact workloads (1k and 10k input-token tests, measured 8 times daily, reported as 72-hour medians) - the standard our own published benchmarks follow.

Sources

Related terms

See these metrics measured live on our EU infrastructure - real numbers from production hardware, independently verified.

Ready to Build the Future of AI in Europe?

Join forward-thinking organizations deploying sovereign AI with world-class performance