Inference vs. training
Machine learning has two fundamentally different phases. Training adjusts a model's parameters against large datasets until its behavior fits the patterns in the data - a massive, one-time (or periodic) computation. Inference applies the finished model: the weights are frozen, an input goes in, an output comes out. IBM's definition captures it well: any instance of an AI model actually generating outputs or making decisions in a real-world application constitutes inference.
You will sometimes see it called "inferencing" - the standard term among practitioners is simply inference. For LLMs specifically, inference means generating tokens: the model processes your prompt (prefill), then produces the response one token at a time (decode).
Why inference is the workload that matters operationally
Training gets the headlines, but inference is where AI meets production - and where cost and speed compound. A model is trained once; it serves millions of requests. Every user interaction, every agent step, every pipeline run pays inference's latency and cost again. As AI shifts toward agentic workloads that generate far more tokens per task, the economics of serving - tokens per second, cost per token, energy per token - increasingly dominate the economics of AI overall.
Inference also stresses hardware differently than training. Training is compute-bound parallel work that GPUs excel at. LLM inference is dominated by the memory-bandwidth-bound decode phase - which is why hardware designed specifically for inference, like the RDU architecture our platform runs on, can outperform general-purpose accelerators on speed and efficiency for this workload.
Measuring inference
Inference performance is measured along the metrics this glossary covers: time to first token (responsiveness), inter-token latency and output throughput (generation speed), and end-to-end latency (total completion time). Where the model runs matters too - inference processes your actual production data on every request, which is why data residency and jurisdiction are inference questions: the model serving your users handles everything they submit.
Sources
Related terms
Inference Speed
The umbrella term: TTFT, inter-token latency, and throughput - and which one matters when.
Prefill vs. Decode
The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.
Tokens per Second
The standard unit for LLM generation speed - and why the same number can mean two different things.
RDU (Reconfigurable Dataflow Unit)
SambaNova's AI processor - purpose-built AI chips designed for dataflow execution instead of instruction-by-instruction processing.
Learn how SambaNova's dataflow architecture changes the economics of inference - and why we built on it.