Step 10 — Detecting the Most Frequent Multi-Word Phrases (N-grams) in Text with Python

Introduction

You’ve explored word co-occurrence with bigrams and visualized word-pair networks. Now it’s time to deepen your analysis by identifying the most common multi-word phrases (also known as n-grams) in a piece of text. This skill helps you catch recurring expressions, idioms, and multi-word entities that single-word token counts would miss. In this step, we’ll walk through building a simple n-gram detector for any “n”—from bigrams to trigrams and beyond—using only basic Python tools.


Main concept explained clearly

An n-gram is any sequence of “n” consecutive tokens (words) in your text. For example:

  • 2-gram (bigram): “machine learning”, “deep learning”
  • 3-gram (trigram): “natural language processing”

Detecting frequent n-grams in text involves:

  1. Tokenizing and cleaning your text as in previous steps.
  2. Sliding a window of size “n” across your token list to extract all possible n-grams.
  3. Counting how many times each n-gram appears.
  4. Displaying the most common ones.

By examining n-grams, you discover not just common words but common phrases, which often carry more semantics.


Why this matters in NLP

  • Captures multi-word expressions: Many key concepts in text are phrases, not single words (“data science”, “New York City”).
  • Improves text analysis: N-gram features boost performance in classification and information retrieval.
  • Essential for search and topic modeling: Common n-grams reveal what users or documents are really about.
  • Foundation for named-entity recognition and language modeling: Modern NLP tools often start with n-gram features.

Python example

Let’s build and test a script that finds the most common bigrams (2-grams) and trigrams (3-grams) in a sample text.

Step 10.1 — Tokenize and filter the text

import string
text = "Python makes it easy to work with natural language. Natural language processing is both fun and practical."
stop_words = set([
'the', 'is', 'in', 'it', 'by', 'and', 'a', 'of', 'to', 'for', 'on', 'o', 'a', 'de', 'em', 'para', 'with'
])
def preprocess(text):
clean = text.strip().lower()
translator = str.maketrans('', '', string.punctuation)
no_punct = clean.translate(translator)
tokens = no_punct.split()
return [w for w in tokens if w not in stop_words]
tokens = preprocess(text)
print("Tokens:", tokens)

Step 10.2 — Extract n-grams from token list

def get_ngrams(tokens, n):
return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
bigrams = get_ngrams(tokens, 2)
trigrams = get_ngrams(tokens, 3)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Step 10.3 — Count n-gram frequencies

from collections import Counter
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)
print("Bigram frequencies:", bigram_freq)
print("Trigram frequencies:", trigram_freq)

Step 10.4 — Display top N n-grams

top_n = 3
print("Top bigrams:")
for phrase, count in bigram_freq.most_common(top_n):
print(f"{' '.join(phrase)}: {count}")
print("\nTop trigrams:")
for phrase, count in trigram_freq.most_common(top_n):
print(f"{' '.join(phrase)}: {count}")

Step 10.5 — Complete code for Step 10

“`python name=step10_ngrams_frequent_phrases.py
import string
from collections import Counter

text = “Python makes it easy to work with natural language. Natural language processing is both fun and practical.”

stop_words = set([
‘the’, ‘is’, ‘in’, ‘it’, ‘by’, ‘and’, ‘a’, ‘of’, ‘to’, ‘for’, ‘on’, ‘o’, ‘a’, ‘de’, ’em’, ‘para’, ‘with’
])

def preprocess(text):
clean = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct = clean.translate(translator)
tokens = no_punct.split()
return [w for w in tokens if w not in stop_words]

def get_ngrams(tokens, n):
return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

tokens = preprocess(text)
bigrams = get_ngrams(tokens, 2)
trigrams = get_ngrams(tokens, 3)

bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)

top_n = 3

print(“Top bigrams:”)
for phrase, count in bigram_freq.most_common(top_n):
print(f”{‘ ‘.join(phrase)}: {count}”)

print(“\nTop trigrams:”)
for phrase, count in trigram_freq.most_common(top_n):
print(f”{‘ ‘.join(phrase)}: {count}”)
“`


Line-by-line explanation of the code

  • preprocess(...): Cleans, lowercases, removes punctuation and stop words, returns token list.
  • get_ngrams(tokens, n): Generates a list of n-gram tuples by sliding a window of size n over the tokens.
  • Counter(...): Fast frequency counts, returning the most common n-grams.
  • for phrase, count in ...: Loops over and prints the top n-grams, nicely formatted.

Practical notes

  • Try different values of n (e.g., 2 for bigrams, 3 for trigrams, 4 for four-word phrases).
  • Longer n-grams become rarer but are more likely to be true “phrases”.
  • Update your stop word list for Portuguese or other languages; edit the text to suit.
  • For longer texts, increase top_n or print all results.
  • Combine bigram/trigram ranking with your bar chart/visualization skills from Step 6 for more insight.

Suggested mini exercise

  1. Use a Portuguese text (with a matched stop words set) and extract the top bigrams/trigrams.
  2. Compare the top n-grams in two different paragraphs.
  3. Challenge: Display all bigrams that appear more than once.
  4. Try an n-gram of length 4 or higher on a larger text and see what comes up.

Conclusion

Detecting frequent n-grams lets you find and quantify the most important multi-word phrases in any text—a massive leap beyond simple word counts! This practice is fundamental in NLP, powering real-world tools from autocorrect to web search and topic modeling. In the next step, you’ll learn how to “lemmatize” and “stem” words—letting you count word families together for a deeper, language-wise analysis.

Edvaldo Guimrães Filho Avatar

Published by