Module: Corpus Construction

Sampling Strategies

How we choose the texts that define our understanding of language.

Having explored the diverse sources of texts, the question arises: how do we choose? The choice of sampling strategy is the architect of your corpus. It determines whether your data stands as a true mirror of language or a distorted reflection.

In this guide, we will visualize the three pillars of sampling: Random, Stratified, and Purposive.


Random Sampling

The "Equal Opportunity" approach. Every text in the sampling frame (the population) has an exactly equal probability of being chosen. It is the gold standard for minimizing bias and ensuring high generalizability.

Bias Level
Very Low
Population Frame (N=100)
Waiting to sample... P(select) = 1/N

Why use it?

Use this when you need to make generalizations about a whole language or population (e.g., "What is the average sentence length in English news?"). It allows for statistical confidence calculations.

Stratified Sampling

The "Structured" approach. The population is first divided into subgroups (strata) based on known characteristics (like genre, year, or author gender). We then sample proportionally from each to ensure representation.

Complexity
High
Mixed Population
Science (30%) Fiction (50%) News (20%)
Population mixed. Waiting to organize...

Why use it?

Essential for comparative studies. If you want to compare 19th-century vs 20th-century English, you must ensure you have enough data from both periods, rather than leaving it to chance.

Purposive Sampling

The "Judgmental" approach. A non-probability method where the researcher deliberately selects texts based on specific criteria. It sacrifices generalizability for depth and relevance to a specific research question.

Specificity
Very High
Population with Hidden Traits
Waiting to scan for specific criteria... Criteria: High Density

Why use it?

Perfect for investigating rare or specific phenomena. If you want to study "passive voice usage to obscure blame," you would purposively select legal contracts and political apologies, ignoring other genres.