The Challenge
Dictionaries and thesauri are the backbone of language technology. But language is alive—it evolves every day. Lexical Acquisition is the systematic process of collecting new data to keep these digital tools robust and up-to-date.
Continuous Enrichment
Maintaining dynamism
Corpus-Based Analysis
The primary engine for discovery is the "Corpus"—a massive structured set of texts representing real-world language usage.
Sources
Collections range from classic literature and spoken transcripts to vast aggregations of web-based content.
Observations
Lexicographers study word patterns, frequency, and semantic associations hidden within the text.
Collocations
Identifying how words group together (e.g., "strong coffee" vs "powerful computer") to find idiomatic usage.
Techniques of Extraction
The Power of Numbers
Statistical analysis involves computing metrics to quantify language.
- Frequency Counts: Identify common patterns and word importance.
- Collocation Strength: Find words that appear together more often than chance (idioms).
- Distribution: How a word spreads across different genres.
Processing...
[RESULT]
"internet": 14,050 occurrences (High)
"floppy disk": 12 occurrences (Low)
> analyze_collocation("make")
"make sense" (score: 0.98)
"make decision" (score: 0.95)
"make homework" (score: 0.05 - ERROR)
The Intelligent Approach
Machine Learning (ML) handles large-scale data and context-dependent meanings.
Supervised Learning
Learns from annotated examples. Used for POS tagging, Named Entity Recognition (NER), and Word Sense Disambiguation.
Unsupervised Learning
Finds hidden patterns in unlabelled data. Groups similar words and identifies semantic relationships.
The Lexical Acquisition Pipeline
1. Input
Raw text from Web, Books, and Speech.
2. Processing
Statistical Analysis & Machine Learning extraction.
3. Output
Curated, standardized, and enriched dictionaries.
Knowledge Check
Which method is best for identifying "idiomatic language use" based on words appearing together?