Inference & Serving

TensorRT-LLM

NVIDIA's library for compiling LLMs into highly optimized GPU inference engines.

Definition

TensorRT-LLM (often abbreviated TRT-LLM) is NVIDIA's open-source library for optimizing and running LLM inference on NVIDIA GPUs. It builds on TensorRT to compile a model into a version tuned for a specific GPU, applying tricks like merging steps, quantization (storing numbers in smaller, lower-precision forms to save memory and time), and smart batching of requests. The result is very fast responses and high volume on NVIDIA hardware, at the cost of a one-time build step done in advance. It is one of the engines underneath NVIDIA NIM and Triton Inference Server.

Related terms

TensorRT NIM vLLM Quantization CUDA