Discover how machines perceive text as an unordered collection of words—ignoring grammar to focus on frequency.
Imagine shaking a sentence in a bag. "The cat sat on the mat" becomes a jumble of words. The order is lost, but we know "cat" is in there.
"The cat sat on the mat" and "The mat sat on the cat" look exactly the same to this model.
In NLP, vectors don't represent physical forces. They are Frequency Vectors—organized lists where each slot corresponds to a word from a dictionary.
The Dictionary: A unique list of all words found in your corpus. Its size determines the vector length.
The Count: The value in each vector slot is the number of times that word appears in the specific document.
Type a sentence below to see how the Bag of Words model "sees" it.
Effortlessly transforms variable-length text into fixed-length vectors suitable for ML algorithms. Great for rapid text analysis of large volumes.
Excellent for identifying dominant themes. If "dog" and "run" appear often, the document is likely about dogs running, regardless of grammar.
It ignores context and word order. "Dog bites man" and "Man bites dog" are treated identically. It misses sarcasm, emotion, and nuance.
Instead of single words, consider sequences of 'n' words.
Term Frequency - Inverse Document Frequency. Not all words are equal.
Breaking sentences into pieces (tokens). "The cat sat" → ["The", "cat", "sat"].
Build a list of all unique words encountered across all documents.
Tally marks against the vocabulary. Convert text to numbers for the machine.