All terms
Safety & Alignment
Toxicity
Harmful model output such as hate speech, harassment, slurs, threats, or graphic violence.
Definition
Toxicity refers to a class of harmful model outputs including hate speech, harassment, slurs, threats, and graphic violence that can harm or offend users. Measuring and reducing it is a core safety task, typically using classifiers trained on annotated harmful-content datasets. Common mitigations include training on human preferences, guardrail classifiers, and constitutional methods, though defining what counts as toxic varies across contexts and cultures.