Glossary
Serving Concepts

Inference

Inference is the use of a trained AI model to produce outputs on new inputs - every chat response, code completion, or API call to an LLM is inference. It is distinct from training, which creates the model in the first place: training happens once, inference runs on every single request.

Inference vs. training

Machine learning has two fundamentally different phases. Training adjusts a model's parameters against large datasets until its behavior fits the patterns in the data - a massive, one-time (or periodic) computation. Inference applies the finished model: the weights are frozen, an input goes in, an output comes out. IBM's definition captures it well: any instance of an AI model actually generating outputs or making decisions in a real-world application constitutes inference.

You will sometimes see it called "inferencing" - the standard term among practitioners is simply inference. For LLMs specifically, inference means generating tokens: the model processes your prompt (prefill), then produces the response one token at a time (decode).

Why inference is the workload that matters operationally

Training gets the headlines, but inference is where AI meets production - and where cost and speed compound. A model is trained once; it serves millions of requests. Every user interaction, every agent step, every pipeline run pays inference's latency and cost again. As AI shifts toward agentic workloads that generate far more tokens per task, the economics of serving - tokens per second, cost per token, energy per token - increasingly dominate the economics of AI overall.

Inference also stresses hardware differently than training. Training is compute-bound parallel work that GPUs excel at. LLM inference is dominated by the memory-bandwidth-bound decode phase - which is why hardware designed specifically for inference, like the RDU architecture our platform runs on, can outperform general-purpose accelerators on speed and efficiency for this workload.

Measuring inference

Inference performance is measured along the metrics this glossary covers: time to first token (responsiveness), inter-token latency and output throughput (generation speed), and end-to-end latency (total completion time). Where the model runs matters too - inference processes your actual production data on every request, which is why data residency and jurisdiction are inference questions: the model serving your users handles everything they submit.

Sources

Related terms

Learn how SambaNova's dataflow architecture changes the economics of inference - and why we built on it.

Ready to Build the Future of AI in Europe?

Join forward-thinking organizations deploying sovereign AI with world-class performance