What TTFT actually measures

When a request arrives, the model first processes the entire prompt in a single parallel pass - the prefill phase - to build its internal state (the KV cache) before it can emit the first token. TTFT captures this prompt processing time, plus everything around it: time spent waiting in the provider's queue and the network round-trip. Standard benchmarking definitions measure it as the time from submitting the query to receiving the first token, including request queuing, prefill, and network latency.

Because prefill processes every input token, TTFT scales with prompt length: the longer the prompt, the longer the model takes to produce its first token. Under heavy load, queueing becomes the dominant factor - if more requests arrive than the system can batch, TTFT rises even though the model itself is no faster or slower.

Why it matters

TTFT is the "responsiveness" metric. In chat interfaces, it determines how long the screen stays empty after the user hits enter - the single biggest factor in whether an AI application feels fast. For voice agents, TTFT is even more critical: a conversational pause longer than a second feels broken.

On our production infrastructure in Munich, we measure a p50 TTFT of 388 ms for gpt-oss-120b with a 10,000-token input - server-side, on a long prompt. Artificial Analysis, the independent benchmarking organization, defines TTFT the same way we report it: the time between sending a request and receiving the first token of the response.

Nuances worth knowing

Client-measured TTFT and server-measured TTFT differ: the client sees queue time plus prefill plus network, while server-side metrics typically separate queue time from prefill time. When comparing providers, check which one is being reported. For reasoning models there is a further distinction - the first token may be a "thinking" token, so benchmarks track time to first token and time to first answer token separately.

A complete picture of response speed needs TTFT together with output speed: total latency is roughly TTFT plus the number of generated tokens multiplied by the time per output token. And what your users actually experience - end-to-end latency - includes contributors outside the model entirely: network round-trips (which hit twice, request and response, and grow with geographic distance to the datacenter), gateway overhead for authentication and routing, and queue time on shared infrastructure. A provider advertising fast TTFT from another continent can still feel slow in Europe; this is one reason we publish server-side and note that client-side results vary by location.

Sources

Related terms

Inter-Token Latency (ITL)

The average time gap between consecutive tokens during generation - also called TPOT.

Inference Speed

The umbrella term: TTFT, inter-token latency, and throughput - and which one matters when.

Prefill vs. Decode

The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.

Context Window

The maximum amount of text, in tokens, a model can consider at once - prompt plus output. Its length directly shapes inference speed and cost.

See these metrics measured live on our EU infrastructure - real numbers from production hardware, independently verified.

TTFT (Time to First Token)

What TTFT actually measures

Why it matters

Nuances worth knowing

Sources

Related terms

Ready to Build the Future of AI in Europe?