Inference vs. training

Machine learning has two fundamentally different phases. Training adjusts a model's parameters against large datasets until its behavior fits the patterns in the data - a massive, one-time (or periodic) computation. Inference applies the finished model: the weights are frozen, an input goes in, an output comes out. IBM's definition captures it well: any instance of an AI model actually generating outputs or making decisions in a real-world application constitutes inference.

You will sometimes see it called "inferencing" - the standard term among practitioners is simply inference. For LLMs specifically, inference means generating tokens: the model processes your prompt (prefill), then produces the response one token at a time (decode).

Why inference is the workload that matters operationally

Training gets the headlines, but inference is where AI meets production - and where cost and speed compound. A model is trained once; it serves millions of requests. Every user interaction, every agent step, every pipeline run pays inference's latency and cost again. As AI shifts toward agentic workloads that generate far more tokens per task, the economics of serving - tokens per second, cost per token, energy per token - increasingly dominate the economics of AI overall.

Inference also stresses hardware differently than training. Training is compute-bound parallel work that GPUs excel at. LLM inference is dominated by the memory-bandwidth-bound decode phase - which is why hardware designed specifically for inference, like the RDU architecture our platform runs on, can outperform general-purpose accelerators on speed and efficiency for this workload.

Measuring inference

Inference performance is measured along the metrics this glossary covers: time to first token (responsiveness), inter-token latency and output throughput (generation speed), and end-to-end latency (total completion time). Where the model runs matters too - inference processes your actual production data on every request, which is why data residency and jurisdiction are inference questions: the model serving your users handles everything they submit.

Sources

Related terms

Inference Speed

The umbrella term: TTFT, inter-token latency, and throughput - and which one matters when.

Prefill vs. Decode

The two phases of LLM inference - parallel prompt processing vs. token-by-token generation.

Tokens per Second

The standard unit for LLM generation speed - and why the same number can mean two different things.

RDU (Reconfigurable Dataflow Unit)

SambaNova's AI processor - purpose-built AI chips designed for dataflow execution instead of instruction-by-instruction processing.

Open-Weight Model

A model whose trained parameters are published so anyone can run it themselves - the technical basis for sovereign inference.

Learn how SambaNova's dataflow architecture changes the economics of inference - and why we built on it.

Inference

Inference vs. training

Why inference is the workload that matters operationally

Measuring inference

Sources

Related terms

Ready to Build the Future of AI in Europe?