Where does the data come from?
Textual data for corpus construction is sourced from a vast array of materials. These sources are generally categorized into three main pillars: Written Texts, Spoken Language, and Specialized Domains. Each offers unique linguistic insights and presents its own set of challenges.
Written Texts
Written texts serve as a "gold mine" of linguistic information. They encapsulate a vast range of topics, genres, and styles, from historical accounts to modern opinion editorials.
Physical Sources
Books, essays, and newspapers sourced primarily from libraries.
Digital Archives
Online journals, blogs, social media, and digitized historical documents.
The Challenge
Collecting written text isn't just about copy-paste.
Copyright Law
Many texts are intellectual property.
- Permissions required
- Negotiations & Fees
- Fair Use limitations
Test Your Knowledge
1. What is a major legal challenge when collecting written texts?
2. Which source captures nuances like intonation and hesitations?
3. Which domain-specific corpus contains "User Manuals" and "Patent Descriptions"?