COMPUTATIONAL LINGUISTICS 4.1

Decomposing Texts with
Bag of Words Model

Discover how machines perceive text as an unordered collection of words—ignoring grammar to focus on frequency.

The "Bag" Metaphor

Imagine shaking a sentence in a bag. "The cat sat on the mat" becomes a jumble of words. The order is lost, but we know "cat" is in there.

The Core Limitation

"The cat sat on the mat" and "The mat sat on the cat" look exactly the same to this model.

4.1.1 Vector Representation

In NLP, vectors don't represent physical forces. They are Frequency Vectors—organized lists where each slot corresponds to a word from a dictionary.

1

The Dictionary: A unique list of all words found in your corpus. Its size determines the vector length.
2

The Count: The value in each vector slot is the number of times that word appears in the specific document.

// Example Dictionary

["the", "cat", "sat", "on", "mat"]

// Sentence: "The cat sat on the mat"

the

cat

sat

mat

Vector = [2, 1, 1, 1, 1]

Interactive Lab

BoW Generator

Type a sentence below to see how the Bag of Words model "sees" it.

Input Text:

1. Tokenization & Vocabulary

2. Frequency Vector

Why Use It? (And Why Not)

Simplicity & Speed

Effortlessly transforms variable-length text into fixed-length vectors suitable for ML algorithms. Great for rapid text analysis of large volumes.

Text Classification

Excellent for identifying dominant themes. If "dog" and "run" appear often, the document is likely about dogs running, regardless of grammar.

The Semantic Gap

It ignores context and word order. "Dog bites man" and "Man bites dog" are treated identically. It misses sarcasm, emotion, and nuance.

4.1.3 Beyond Basic BoW

N-Gram Models

Instead of single words, consider sequences of 'n' words.

Bigram (n=2): ["Dog bites", "bites man"]
Result: Preserves local order. "Man bites" ≠ "Dog bites".

TF-IDF Weighting

Term Frequency - Inverse Document Frequency. Not all words are equal.

Down-weights common words like "the", "is", "and".
Amplifies unique, topic-specific terms.

Implementation Workflow

1. Tokenization

Breaking sentences into pieces (tokens). "The cat sat" → ["The", "cat", "sat"].

2. Create Vocabulary

Build a list of all unique words encountered across all documents.

3. Feature Vectors

Tally marks against the vocabulary. Convert text to numbers for the machine.

Decomposing Texts withBag of Words Model