The Process
Acquiring textual data is not a singular task but a multifaceted process. It necessitates techniques tailored to the specific nature of the source material. Whether dealing with ancient manuscripts or fleeting tweets, the methodology must adapt.
Written Texts
The mainstay of many corpora. Sourcing these involves navigating both physical shelves and digital clouds.
Repositories
Libraries, archives, and online databases hosting everything from classical literature to government records.
Legal Considerations
Copyright is a major hurdle. Obtaining permissions can be complex, time-consuming, and requires careful negotiation.
Digital Archives
Massive collections of digitized texts readily available for analysis, though often behind paywalls or legal gates.
Web Data Acquisition
The collection process becomes more technically involved on the internet.
Web Scraping
Click to reveal definition
Targeted Extraction
Using tools or scripts to extract specific data (text, metadata, dates) from web pages.
Web Crawling
Click to reveal definition
Broad Indexing
Automated browsing (bots) to index large amounts of information across many websites.
Spoken Language Corpora
1. Recording
Capturing interactions: conversations, interviews, and verbal exchanges in diverse settings.
2. Transcribing
Transforming audio into written format. This is labor-intensive and requires meticulous attention.
3. Analysis
Capturing speech features, pauses, and non-verbal cues for accurate linguistic study.
Domain-Specific & Collaborative
Technical, legal, medical, or academic documents often reside in private repositories or subscription services. Acquiring this data requires a collaborative approach with subject matter experts to ensure:
- Access to appropriate sources
- Accurate understanding & annotation
- Rigorous selection for quality