What ITL actually measures
After the first token (measured by TTFT), the model generates the rest of the response one token at a time - the decode phase. ITL is the average gap between those tokens. Benchmarking tools commonly compute it as end-to-end latency minus TTFT, divided by the number of output tokens minus one - explicitly excluding the first token so that ITL reflects pure generation speed.
ITL and output speed are two views of the same thing: per-request tokens per second approaches 1 divided by ITL as the output grows longer. Our measured 713 tokens per second on gpt-oss-120b corresponds to an average inter-token gap of roughly 1.4 milliseconds; 428 tokens per second on MiniMax M2.7 Ultraspeed corresponds to roughly 2.3 milliseconds.
Why it matters
For a human reading a chat response, almost any modern provider is fast enough - people read at a few tokens per second. ITL becomes decisive in agentic workloads, where software consumes the output: a coding agent waiting for a 3,000-token diff waits for every single token. At a 30-millisecond gap that is 90 seconds of generation; at 1.4 milliseconds it is about 4 seconds. Because agents loop - generate, run tools, generate again - inter-token latency compounds across every step of the chain.
ITL also reveals infrastructure quality under load. The decode phase is bound by memory bandwidth, so when a provider batches many users onto the same hardware, each user's ITL degrades. Stable ITL across the day is a signal that capacity is genuinely provisioned, not oversubscribed.
Nuances worth knowing
Benchmarking tools disagree on the exact formula: some exclude the first token from the average, others include it - so ITL numbers from different tools are not directly comparable. ITL is also an average: real token gaps vary during a response as the KV cache grows. You may also see the metric labeled TPOT (time per output token); the two terms are used interchangeably across the industry's engineering documentation.
Sources
Related terms
TTFT (Time to First Token)
How long a user waits between sending a request and seeing the first token of the response.
Tokens per Second
The standard unit for LLM generation speed - and why the same number can mean two different things.
Prefill vs. Decode
The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.
See these metrics measured live on our EU infrastructure - real numbers from production hardware, independently verified.