Step 7 — Comparing Word Frequencies Between Two Texts in Python

Introduction

By now, you can read, clean, tokenize, filter, and visualize word frequencies in a single text. But in real NLP projects, comparison is as important as analysis. How do two different texts differ in their vocabulary? Which words are unique to one text or more frequent in another? In this step, you’ll learn how to practically compare the word frequencies of two texts side-by-side—laying the groundwork for author attribution, document classification, and deeper stylistic or content analysis.

Main concept explained clearly

Comparing texts boils down to three major skills:

Generating a filtered frequency dictionary for each text (learned in previous steps).
Aligning the vocabularies, so you can compare counts for each word across both texts.
Visualizing or printing these comparisons in a clear way.

By creating a table (or two-column printout), you can quickly spot:

Which words are shared or unique?
Are certain terms unusually frequent in one vs. the other text?
Where is there overlap or big differences?

This is foundational for advanced approaches like document similarity and clustering.

Why this matters in NLP

Many NLP tasks (e.g., spam detection, sentiment analysis, authorship determination) are about differences and similarities between documents.
Quick word frequency comparison reveals stylistic or topical differences, jargon, or even writing habits.
Foundation for building term-document matrices, pairwise distance metrics, and machine learning features.

Python example

Let’s compare two simple English (or Portuguese/English mixed) texts, side by side, step by step.

Step 7.1 — Prepare and tokenize both texts

			
import string
text_a = "NLP is practical. Learning step by step makes it easier to learn."
text_b = "Learning Python for data science is both rewarding and challenging. Step up your skills!"
stop_words = set([
    'the', 'is', 'in', 'it', 'by', 'and', 'a', 'of', 'to', 'for', 'on', 'o', 'a', 'de', 'em', 'para', 'your'
])
def preprocess(text):
    clean = text.strip().lower()
    translator = str.maketrans('', '', string.punctuation)
    no_punct = clean.translate(translator)
    tokens = no_punct.split()
    filtered = [w for w in tokens if w not in stop_words]
    return filtered
tokens_a = preprocess(text_a)
tokens_b = preprocess(text_b)
print("Text A tokens:", tokens_a)
print("Text B tokens:", tokens_b)

		

Step 7.2 — Build frequency dictionaries for both

			
def word_freq(tokens):
    freq = {}
    for w in tokens:
        freq[w] = freq.get(w, 0) + 1
    return freq
freq_a = word_freq(tokens_a)
freq_b = word_freq(tokens_b)

		

Step 7.3 — Combine vocabularies and print comparison

			
all_words = sorted(set(freq_a.keys()).union(freq_b.keys()))
print("{:<12}  {:>5}  {:>5}".format("Word", "A", "B"))
print("-" * 25)
for w in all_words:
    count_a = freq_a.get(w, 0)
    count_b = freq_b.get(w, 0)
    print("{:<12}  {:>5}  {:>5}".format(w, count_a, count_b))

		

Step 7.4 — Complete script for Step 7

“`python name=step07_compare_frequencies.py
import string

text_a = “NLP is practical. Learning step by step makes it easier to learn.”
text_b = “Learning Python for data science is both rewarding and challenging. Step up your skills!”

stop_words = set([
‘the’, ‘is’, ‘in’, ‘it’, ‘by’, ‘and’, ‘a’, ‘of’, ‘to’, ‘for’, ‘on’, ‘o’, ‘a’, ‘de’, ’em’, ‘para’, ‘your’
])

def preprocess(text):
clean = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct = clean.translate(translator)
tokens = no_punct.split()
filtered = [w for w in tokens if w not in stop_words]
return filtered

def word_freq(tokens):
freq = {}
for w in tokens:
freq[w] = freq.get(w, 0) + 1
return freq

tokens_a = preprocess(text_a)
tokens_b = preprocess(text_b)
freq_a = word_freq(tokens_a)
freq_b = word_freq(tokens_b)

all_words = sorted(set(freq_a.keys()).union(freq_b.keys()))

print(“{:<12} {:>5} {:>5}”.format(“Word”, “A”, “B”))
print(“-” * 25)
for w in all_words:
count_a = freq_a.get(w, 0)
count_b = freq_b.get(w, 0)
print(“{:<12} {:>5} {:>5}”.format(w, count_a, count_b))
“`

Line-by-line explanation of the code

import string: Uses punctuation constants for text cleaning.
text_a, text_b: Provide two example texts for comparison.
stop_words: The words to ignore, covering both languages.
preprocess(...): Cleans, tokenizes, and stop-word-filters each text.
word_freq(...): Counts frequencies for each set of tokens.
tokens_a, tokens_b, freq_a, freq_b: Token lists and word frequency dictionaries for each text.
all_words = sorted(set(freq_a.keys()).union(freq_b.keys())): Combines all unique words from both texts, sorted alphabetically.
The print loop: Outputs a formatted table with each word and its count in both texts, filling in 0s if the word appears in only one.

Practical notes

This is a “side-by-side” comparison; you can also graph differences for larger datasets.
Frequency dictionaries allow fast lookups (using .get(key, 0) for absent words).
The approach works for any number of texts, with small code changes.
Adapt the stop word list for your language and context!
With larger real data, outputting or plotting the “top N differences” may be more useful than a full list.

Suggested mini exercise

Swap in your own texts (in English, Portuguese, or both).
Add more stop words based on the analysis output.
Try highlighting which words are unique to only one text (hint: look for counts of 0).
Optionally, plot the results for the top 5 differences as a bar chart.
Compare two texts from different domains (e.g., one about sports, one about science).

Conclusion

Comparing word frequencies between texts is a practical and foundational move in any NLP workflow. It powers tasks like topic analysis, document clustering, authorship comparison, and much more. By mastering this “side-by-side” comparison, you set yourself up for richer and more meaningful NLP tasks in the future—including document similarity, more advanced feature engineering, and even deep learning projects.

Onwards: In Step 8, you’ll take token analysis further by exploring word co-occurrence (counting which words appear together)—a foundational element for context, sentiment, and word embedding models. When you’re ready, just say “next”!