Safety & Alignment

Guardrails

Safety controls that filter inputs and outputs to keep model behavior within policy.

Definition

Guardrails are the checks placed around a model at inference time to keep its behavior within policy. They can act on inputs (blocking harmful prompts), outputs (filtering or rewriting unsafe responses), or both, and range from keyword rules to classifier models and dedicated libraries. Operating across the prompt, model, and application layers, they complement training-time alignment with runtime enforcement.

Guardrails

Definition

Related terms