Skip to main content
All terms
Evaluation

GSM8K

A dataset of grade-school math word problems testing multi-step arithmetic reasoning.

Definition

GSM8K is a dataset of several thousand grade-school math word problems, each requiring a handful of basic arithmetic steps to solve. It became a key benchmark for multi-step numerical reasoning in language models and helped demonstrate the value of chain-of-thought prompting (having the model show its work step by step). Strong reasoning models now score very high on it, so it serves more as a baseline than a frontier test.