Step 8 — Exploring Word Co-occurrence: Counting Word Pairs Appearing Together in Python

Introduction

You’ve learned to compare word frequencies in different texts. Now, it’s time to look deeper into how words relate to each other within the same text. In this step, you’ll build your first co-occurrence matrix—a foundational structure for capturing which words are likely to appear together. Understanding word co-occurrence is critical for representing meaning, analyzing context, and laying the groundwork for modern NLP techniques like word embeddings and topic modeling.

Main concept explained clearly

Co-occurrence describes how often two words appear together within a certain window in a text. For example, in the phrase “data science is fun”, “data” and “science” are neighbors and thus co-occur. We typically count co-occurrences in two ways:

Bigram-based: Looking for all consecutive word pairs (bigrams).
Window-based: Counting word pairs within a sliding window (e.g., ±2 words).

For beginners, we’ll focus on bigram co-occurrence, as it’s easier to compute and understand. The result is a mapping of all word pairs (bigrams) to the number of times they appear together in the text.

Why this matters in NLP

Co-occurrence is the basis for capturing word associations, context, and structure in language.
It powers algorithms for keyword extraction, topic modeling, and recommendation.
Modern word embeddings (like Word2Vec) are based on these statistics.
Understanding how words group and cluster helps you move beyond isolated word counts.

Python example

Let’s walk through a simple bigram co-occurrence counter for a text with stop words removed. We’ll use everything you learned up to this point.

Step 8.1 — Clean, tokenize, and filter the text

			
import string
text = "Learning NLP is practical and fun. Fun projects make learning NLP easier!"
stop_words = set([
    'the', 'is', 'in', 'it', 'by', 'and', 'a', 'of', 'to', 'for', 'on', 'o', 'a', 'de', 'em', 'para'
])
def preprocess(text):
    clean = text.strip().lower()
    translator = str.maketrans('', '', string.punctuation)
    no_punct = clean.translate(translator)
    tokens = no_punct.split()
    filtered = [w for w in tokens if w not in stop_words]
    return filtered
tokens = preprocess(text)
print("Filtered tokens:", tokens)

		

Step 8.2 — Count co-occurring (bigram) pairs

			
bigram_counts = {}
for i in range(len(tokens) - 1):
    pair = (tokens[i], tokens[i+1])
    bigram_counts[pair] = bigram_counts.get(pair, 0) + 1
print("Bigram co-occurrences:", bigram_counts)

		

Line-by-line explanation of the code

Import and define stop words:
As before, punctuation is stripped and common words are filtered out.
preprocess function:
Cleans and tokenizes the text, then removes stop words.
Bigram counting:
Loop over range(len(tokens) - 1) so each token can be paired with its immediate neighbor.
For each index i, make a tuple (pair) of tokens: tokens[i], tokens[i+1].
Update a dictionary with pairs as keys and counts as values.
Printing results:
Shows a dictionary where keys are bigram word pairs and values are their frequencies.

Practical notes

This method only finds bigrams (adjacent words). For wider context, increase the window size (e.g., all pairs within ±2 positions).
You can adapt this to texts in Portuguese or any other language—just update your sample and stop word set.
The result can be used to build network graphs, visualize common word pairs, or feed algorithms like Markov models.

Suggested mini exercise

Change the example text and observe new bigrams.
Try with a short Portuguese sentence, using Portuguese stop words.
Alter the script to skip over bigrams if either word is a stop word (extra challenge: filter both before counting).
Print the top N most common bigrams from a longer text.

Conclusion

You’ve now moved beyond simple statistics and learned to analyze text structure: which pairs of words appear together, and how often? Co-occurrence analysis opens the door to advanced NLP techniques and richer representations of text. In the next step, you’ll start to visualize these bigrams, building the intuition needed for context-aware and semantic text analysis.