All terms
Inference & Serving
Triton Inference Server
NVIDIA's open-source server for deploying models from many frameworks at scale.
Definition
Triton Inference Server is NVIDIA's open-source serving system for deploying trained models in production. It supports multiple backends (TensorRT, PyTorch, ONNX, TensorRT-LLM, and more), dynamic batching, concurrent model execution, and standard HTTP/gRPC APIs. It is a core building block underneath NVIDIA NIM. (Not to be confused with OpenAI's Triton GPU programming language.)