Step 10 — Detecting the Most Frequent Multi-Word Phrases (N-grams) in Text with Python
Introduction
You’ve explored word co-occurrence with bigrams and visualized word-pair networks. Now it’s time to deepen your analysis by identifying the most common multi-word phrases (also known as n-grams) in a piece of text. This skill helps you catch recurring expressions, idioms, and multi-word entities that single-word token counts would miss. In this step, we’ll walk through building a simple n-gram detector for any “n”—from bigrams to trigrams and beyond—using only basic Python tools.
Main concept explained clearly
An n-gram is any sequence of “n” consecutive tokens (words) in your text. For example:
- 2-gram (bigram): “machine learning”, “deep learning”
- 3-gram (trigram): “natural language processing”
Detecting frequent n-grams in text involves:
- Tokenizing and cleaning your text as in previous steps.
- Sliding a window of size “n” across your token list to extract all possible n-grams.
- Counting how many times each n-gram appears.
- Displaying the most common ones.
By examining n-grams, you discover not just common words but common phrases, which often carry more semantics.
Why this matters in NLP
- Captures multi-word expressions: Many key concepts in text are phrases, not single words (“data science”, “New York City”).
- Improves text analysis: N-gram features boost performance in classification and information retrieval.
- Essential for search and topic modeling: Common n-grams reveal what users or documents are really about.
- Foundation for named-entity recognition and language modeling: Modern NLP tools often start with n-gram features.
Python example
Let’s build and test a script that finds the most common bigrams (2-grams) and trigrams (3-grams) in a sample text.
Step 10.1 — Tokenize and filter the text
import stringtext = "Python makes it easy to work with natural language. Natural language processing is both fun and practical."stop_words = set([ 'the', 'is', 'in', 'it', 'by', 'and', 'a', 'of', 'to', 'for', 'on', 'o', 'a', 'de', 'em', 'para', 'with'])def preprocess(text): clean = text.strip().lower() translator = str.maketrans('', '', string.punctuation) no_punct = clean.translate(translator) tokens = no_punct.split() return [w for w in tokens if w not in stop_words]tokens = preprocess(text)print("Tokens:", tokens)
Step 10.2 — Extract n-grams from token list
def get_ngrams(tokens, n): return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]bigrams = get_ngrams(tokens, 2)trigrams = get_ngrams(tokens, 3)print("Bigrams:", bigrams)print("Trigrams:", trigrams)
Step 10.3 — Count n-gram frequencies
from collections import Counterbigram_freq = Counter(bigrams)trigram_freq = Counter(trigrams)print("Bigram frequencies:", bigram_freq)print("Trigram frequencies:", trigram_freq)
Step 10.4 — Display top N n-grams
top_n = 3print("Top bigrams:")for phrase, count in bigram_freq.most_common(top_n): print(f"{' '.join(phrase)}: {count}")print("\nTop trigrams:")for phrase, count in trigram_freq.most_common(top_n): print(f"{' '.join(phrase)}: {count}")
Step 10.5 — Complete code for Step 10
“`python name=step10_ngrams_frequent_phrases.py
import string
from collections import Counter
text = “Python makes it easy to work with natural language. Natural language processing is both fun and practical.”
stop_words = set([
‘the’, ‘is’, ‘in’, ‘it’, ‘by’, ‘and’, ‘a’, ‘of’, ‘to’, ‘for’, ‘on’, ‘o’, ‘a’, ‘de’, ’em’, ‘para’, ‘with’
])
def preprocess(text):
clean = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct = clean.translate(translator)
tokens = no_punct.split()
return [w for w in tokens if w not in stop_words]
def get_ngrams(tokens, n):
return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
tokens = preprocess(text)
bigrams = get_ngrams(tokens, 2)
trigrams = get_ngrams(tokens, 3)
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)
top_n = 3
print(“Top bigrams:”)
for phrase, count in bigram_freq.most_common(top_n):
print(f”{‘ ‘.join(phrase)}: {count}”)
print(“\nTop trigrams:”)
for phrase, count in trigram_freq.most_common(top_n):
print(f”{‘ ‘.join(phrase)}: {count}”)
“`
Line-by-line explanation of the code
preprocess(...): Cleans, lowercases, removes punctuation and stop words, returns token list.get_ngrams(tokens, n): Generates a list of n-gram tuples by sliding a window of size n over the tokens.Counter(...): Fast frequency counts, returning the most common n-grams.for phrase, count in ...: Loops over and prints the top n-grams, nicely formatted.
Practical notes
- Try different values of n (e.g., 2 for bigrams, 3 for trigrams, 4 for four-word phrases).
- Longer n-grams become rarer but are more likely to be true “phrases”.
- Update your stop word list for Portuguese or other languages; edit the text to suit.
- For longer texts, increase
top_nor print all results. - Combine bigram/trigram ranking with your bar chart/visualization skills from Step 6 for more insight.
Suggested mini exercise
- Use a Portuguese text (with a matched stop words set) and extract the top bigrams/trigrams.
- Compare the top n-grams in two different paragraphs.
- Challenge: Display all bigrams that appear more than once.
- Try an n-gram of length 4 or higher on a larger text and see what comes up.
Conclusion
Detecting frequent n-grams lets you find and quantify the most important multi-word phrases in any text—a massive leap beyond simple word counts! This practice is fundamental in NLP, powering real-world tools from autocorrect to web search and topic modeling. In the next step, you’ll learn how to “lemmatize” and “stem” words—letting you count word families together for a deeper, language-wise analysis.
