Glossary
Performance Metrics

Tokens per Second

Tokens per second (tok/s) measures how many tokens an LLM system produces each second. It is the standard unit of inference speed - but the same unit describes two different measurements: per-request output speed (what one user experiences) and system throughput (everything the hardware produces across all users combined).

Per-request speed: what benchmarks report

When independent benchmarks like Artificial Analysis report "output speed," they measure the average number of tokens received per second after the first token arrives - a single request, with time-to-first-token deliberately excluded. This is the number that determines user experience: how fast the response streams.

On our EU infrastructure we measure per-request output throughput of 713 tok/s on gpt-oss-120b and 428 tok/s on MiniMax M2.7 Ultraspeed (p50, 10,000-token input, single request). For comparability across models with different tokenizers, Artificial Analysis standardizes all speed metrics in OpenAI tokens counted with the tiktoken o200k_base tokenizer.

System throughput: what providers' economics depend on

The second meaning is aggregate throughput: the total tokens a system produces per second across all concurrent requests. Benchmarking literature distinguishes these explicitly as "TPS per user" versus "TPS per system." A GPU server might deliver 30 tok/s to each of 100 concurrent users - 3,000 tok/s of system throughput, but a much slower experience per request.

This distinction is where speed claims get murky: a vendor can truthfully advertise thousands of tokens per second while each individual request crawls. When you see a tok/s figure, the first question to ask is: per request, or across the whole system?

How to read a tok/s claim

Check three things: whether it is per-request or aggregate, whether TTFT is included or excluded (Artificial Analysis excludes it by definition), and the workload shape - prompt length and output length both change the number. Our published benchmarks state all three: per-request, server-side p50, 10,000 input / 1,000 output tokens.

Sources

Related terms

See these metrics measured live on our EU infrastructure - real numbers from production hardware, independently verified.

Ready to Build the Future of AI in Europe?

Join forward-thinking organizations deploying sovereign AI with world-class performance