Inference Performance Glossary
Clear definitions of the metrics and concepts behind LLM inference performance - from TTFT to dataflow architecture. Every entry is backed by published sources and real benchmark data from our EU infrastructure.
Performance Metrics
TTFT (Time to First Token)
How long a user waits between sending a request and seeing the first token of the response.
Inter-Token Latency (ITL)
The average time gap between consecutive tokens during generation - also called TPOT.
Tokens per Second
The standard unit for LLM generation speed - and why the same number can mean two different things.
Inference Speed
The umbrella term: TTFT, inter-token latency, and throughput - and which one matters when.
Serving Concepts
Inference
Running a trained AI model to produce outputs - the production workload of AI, and the one whose cost and speed compound with usage.
Throughput (LLM Serving)
Tokens per second in two senses: per-request output throughput vs. system-wide capacity - and how batching trades one against the other.
Prefill vs. Decode
The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.
Latency vs. Throughput
The fundamental serving trade-off: total system output vs. each user's speed.
Architecture
RDU (Reconfigurable Dataflow Unit)
SambaNova's AI processor - purpose-built AI chips designed for dataflow execution instead of instruction-by-instruction processing.
Dataflow Architecture
The execution model where data streams through operations as a pipeline - eliminating the kernel-by-kernel round trips of GPU execution.