Step 10 — Detecting the Most Frequent Multi-Word Phrases (N-grams) in Text with Python

Introduction

You’ve explored word co-occurrence with bigrams and visualized word-pair networks. Now it’s time to deepen your analysis by identifying the most common multi-word phrases (also known as n-grams) in a piece of text. This skill helps you catch recurring expressions, idioms, and multi-word entities that single-word token counts would miss. In this step, we’ll walk through building a simple n-gram detector for any “n”—from bigrams to trigrams and beyond—using only basic Python tools.

Main concept explained clearly

An n-gram is any sequence of “n” consecutive tokens (words) in your text. For example:

2-gram (bigram): “machine learning”, “deep learning”
3-gram (trigram): “natural language processing”

Detecting frequent n-grams in text involves:

Tokenizing and cleaning your text as in previous steps.
Sliding a window of size “n” across your token list to extract all possible n-grams.
Counting how many times each n-gram appears.
Displaying the most common ones.

By examining n-grams, you discover not just common words but common phrases, which often carry more semantics.

Why this matters in NLP

Captures multi-word expressions: Many key concepts in text are phrases, not single words (“data science”, “New York City”).
Improves text analysis: N-gram features boost performance in classification and information retrieval.
Essential for search and topic modeling: Common n-grams reveal what users or documents are really about.
Foundation for named-entity recognition and language modeling: Modern NLP tools often start with n-gram features.

Python example

Let’s build and test a script that finds the most common bigrams (2-grams) and trigrams (3-grams) in a sample text.

Step 10.1 — Tokenize and filter the text

			
import string
text = "Python makes it easy to work with natural language. Natural language processing is both fun and practical."
stop_words = set([
    'the', 'is', 'in', 'it', 'by', 'and', 'a', 'of', 'to', 'for', 'on', 'o', 'a', 'de', 'em', 'para', 'with'
])
def preprocess(text):
    clean = text.strip().lower()
    translator = str.maketrans('', '', string.punctuation)
    no_punct = clean.translate(translator)
    tokens = no_punct.split()
    return [w for w in tokens if w not in stop_words]
tokens = preprocess(text)
print("Tokens:", tokens)

		

Step 10.2 — Extract n-grams from token list

			
def get_ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
bigrams = get_ngrams(tokens, 2)
trigrams = get_ngrams(tokens, 3)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

		

Step 10.3 — Count n-gram frequencies

			
from collections import Counter
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)
print("Bigram frequencies:", bigram_freq)
print("Trigram frequencies:", trigram_freq)

		

Step 10.4 — Display top N n-grams

			
top_n = 3
print("Top bigrams:")
for phrase, count in bigram_freq.most_common(top_n):
    print(f"{' '.join(phrase)}: {count}")
print("\nTop trigrams:")
for phrase, count in trigram_freq.most_common(top_n):
    print(f"{' '.join(phrase)}: {count}")

		

Step 10.5 — Complete code for Step 10

“`python name=step10_ngrams_frequent_phrases.py
import string
from collections import Counter

text = “Python makes it easy to work with natural language. Natural language processing is both fun and practical.”

stop_words = set([
‘the’, ‘is’, ‘in’, ‘it’, ‘by’, ‘and’, ‘a’, ‘of’, ‘to’, ‘for’, ‘on’, ‘o’, ‘a’, ‘de’, ’em’, ‘para’, ‘with’
])

def preprocess(text):
clean = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct = clean.translate(translator)
tokens = no_punct.split()
return [w for w in tokens if w not in stop_words]

def get_ngrams(tokens, n):
return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

tokens = preprocess(text)
bigrams = get_ngrams(tokens, 2)
trigrams = get_ngrams(tokens, 3)

bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)

top_n = 3

print(“Top bigrams:”)
for phrase, count in bigram_freq.most_common(top_n):
print(f”{‘ ‘.join(phrase)}: {count}”)

print(“\nTop trigrams:”)
for phrase, count in trigram_freq.most_common(top_n):
print(f”{‘ ‘.join(phrase)}: {count}”)
“`

Line-by-line explanation of the code

preprocess(...): Cleans, lowercases, removes punctuation and stop words, returns token list.
get_ngrams(tokens, n): Generates a list of n-gram tuples by sliding a window of size n over the tokens.
Counter(...): Fast frequency counts, returning the most common n-grams.
for phrase, count in ...: Loops over and prints the top n-grams, nicely formatted.

Practical notes

Try different values of n (e.g., 2 for bigrams, 3 for trigrams, 4 for four-word phrases).
Longer n-grams become rarer but are more likely to be true “phrases”.
Update your stop word list for Portuguese or other languages; edit the text to suit.
For longer texts, increase top_n or print all results.
Combine bigram/trigram ranking with your bar chart/visualization skills from Step 6 for more insight.

Suggested mini exercise

Use a Portuguese text (with a matched stop words set) and extract the top bigrams/trigrams.
Compare the top n-grams in two different paragraphs.
Challenge: Display all bigrams that appear more than once.
Try an n-gram of length 4 or higher on a larger text and see what comes up.

Conclusion

Detecting frequent n-grams lets you find and quantify the most important multi-word phrases in any text—a massive leap beyond simple word counts! This practice is fundamental in NLP, powering real-world tools from autocorrect to web search and topic modeling. In the next step, you’ll learn how to “lemmatize” and “stem” words—letting you count word families together for a deeper, language-wise analysis.