Sources of Text

Where does the data come from?

Textual data for corpus construction is sourced from a vast array of materials. These sources are generally categorized into three main pillars: Written Texts, Spoken Language, and Specialized Domains. Each offers unique linguistic insights and presents its own set of challenges.

Written Texts

Written texts serve as a "gold mine" of linguistic information. They encapsulate a vast range of topics, genres, and styles, from historical accounts to modern opinion editorials.

Physical Sources

Books, essays, and newspapers sourced primarily from libraries.

Digital Archives

Online journals, blogs, social media, and digitized historical documents.

The Challenge

Collecting written text isn't just about copy-paste.

Copyright Law

Many texts are intellectual property.

Permissions required
Negotiations & Fees
Fair Use limitations

Domain-Specific Corpora

Sometimes a general view isn't enough. Specialized corpora focus on specific fields to capture unique terminologies, jargon, and structural conventions. (Hover over the cards to see examples)

Legal Texts

Examples

Statutes, contracts, judicial opinions.

Focus: Technical jargon & precise conventions

Medical Literature

Examples

Research articles, patient histories, clinical guidelines.

Focus: Dense terminology & structural norms

Technical Docs

Examples

User manuals, software docs, patent descriptions.

Focus: Procedural language & acronyms

Academic Research

Examples

Lab reports, literary critiques, journals.

Focus: Argument structure & citation

Test Your Knowledge

1. What is a major legal challenge when collecting written texts?

2. Which source captures nuances like intonation and hesitations?

3. Which domain-specific corpus contains "User Manuals" and "Patent Descriptions"?

Where does the data come from?

Written Texts

Physical Sources

Digital Archives

The Challenge

Copyright Law

Spoken Language Corpora

Conversations

Interviews

Recorded Dialogs

Broadcasts

Domain-Specific Corpora

Legal Texts

Examples

Medical Literature

Examples

Technical Docs

Examples

Academic Research

Examples

Test Your Knowledge