All terms
Foundations
Inference
Running a trained model on new inputs to produce outputs.
Definition
Inference is the act of running a trained model on new inputs to produce results, such as generating a reply to a prompt. It is distinct from training, and because a deployed model serves requests continuously, inference usually dominates the total compute cost at scale. Key measures include latency, throughput, and cost per token.