Skip to main content
All terms
Data

Web Scraping

Programmatically collecting data from websites.

Definition

Web scraping is the programmatic collection of data from websites, in which automated tools fetch pages and extract their text, links, or other content. It is a common way to assemble large corpora for pretraining language models, often building on broad crawls of the public web. Scraped data usually needs heavy cleaning, deduplication, and filtering before it is suitable for training.