What ITL actually measures

After the first token (measured by TTFT), the model generates the rest of the response one token at a time - the decode phase. ITL is the average gap between those tokens. Benchmarking tools commonly compute it as end-to-end latency minus TTFT, divided by the number of output tokens minus one - explicitly excluding the first token so that ITL reflects pure generation speed.

ITL and output speed are two views of the same thing: per-request tokens per second approaches 1 divided by ITL as the output grows longer. Our measured 713 tokens per second on gpt-oss-120b corresponds to an average inter-token gap of roughly 1.4 milliseconds; 428 tokens per second on MiniMax M2.7 Ultraspeed corresponds to roughly 2.3 milliseconds.

Why it matters

For a human reading a chat response, almost any modern provider is fast enough - people read at a few tokens per second. ITL becomes decisive in agentic workloads, where software consumes the output: a coding agent waiting for a 3,000-token diff waits for every single token. At a 30-millisecond gap that is 90 seconds of generation; at 1.4 milliseconds it is about 4 seconds. Because agents loop - generate, run tools, generate again - inter-token latency compounds across every step of the chain.

ITL also reveals infrastructure quality under load. The decode phase is bound by memory bandwidth, so when a provider batches many users onto the same hardware, each user's ITL degrades. Stable ITL across the day is a signal that capacity is genuinely provisioned, not oversubscribed.

Nuances worth knowing

Benchmarking tools disagree on the exact formula: some exclude the first token from the average, others include it - so ITL numbers from different tools are not directly comparable. ITL is also an average: real token gaps vary during a response as the KV cache grows. You may also see the metric labeled TPOT (time per output token); the two terms are used interchangeably across the industry's engineering documentation.

Sources

Related terms

TTFT (Time to First Token)

How long a user waits between sending a request and seeing the first token of the response.

Tokens per Second

The standard unit for LLM generation speed - and why the same number can mean two different things.

Prefill vs. Decode

The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.

See these metrics measured live on our EU infrastructure - real numbers from production hardware, independently verified.

Inter-Token Latency (ITL)

What ITL actually measures

Why it matters

Nuances worth knowing

Sources

Related terms

Ready to Build the Future of AI in Europe?