Unit 6.1

Sources of Text

Building the Foundation of Corpus Linguistics

Where does the data come from?

Textual data for corpus construction is sourced from a vast array of materials. These sources are generally categorized into three main pillars: Written Texts, Spoken Language, and Specialized Domains. Each offers unique linguistic insights and presents its own set of challenges.

Written Texts

Written texts serve as a "gold mine" of linguistic information. They encapsulate a vast range of topics, genres, and styles, from historical accounts to modern opinion editorials.

Physical Sources

Books, essays, and newspapers sourced primarily from libraries.

Digital Archives

Online journals, blogs, social media, and digitized historical documents.

The Challenge

Collecting written text isn't just about copy-paste.

Copyright Law

Many texts are intellectual property.

  • Permissions required
  • Negotiations & Fees
  • Fair Use limitations

Test Your Knowledge

1. What is a major legal challenge when collecting written texts?

2. Which source captures nuances like intonation and hesitations?

3. Which domain-specific corpus contains "User Manuals" and "Patent Descriptions"?