Inference & Serving

TensorRT

NVIDIA's deep-learning inference optimizer and runtime for GPUs.

Definition

TensorRT is NVIDIA's software toolkit for running trained models fast. It takes a trained model and rebuilds it into a streamlined version tuned for a specific GPU: it merges steps, picks the fastest way to run each operation, and uses lower-precision numbers (which trade a little accuracy for more speed). TensorRT-LLM extends it specifically for large language models.

Related terms

TensorRT-LLM CUDA Quantization NIM