Step 8 — Exploring Word Co-occurrence: Counting Word Pairs Appearing Together in Python
Introduction
You’ve learned to compare word frequencies in different texts. Now, it’s time to look deeper into how words relate to each other within the same text. In this step, you’ll build your first co-occurrence matrix—a foundational structure for capturing which words are likely to appear together. Understanding word co-occurrence is critical for representing meaning, analyzing context, and laying the groundwork for modern NLP techniques like word embeddings and topic modeling.
Main concept explained clearly
Co-occurrence describes how often two words appear together within a certain window in a text. For example, in the phrase “data science is fun”, “data” and “science” are neighbors and thus co-occur. We typically count co-occurrences in two ways:
- Bigram-based: Looking for all consecutive word pairs (bigrams).
- Window-based: Counting word pairs within a sliding window (e.g., ±2 words).
For beginners, we’ll focus on bigram co-occurrence, as it’s easier to compute and understand. The result is a mapping of all word pairs (bigrams) to the number of times they appear together in the text.
Why this matters in NLP
- Co-occurrence is the basis for capturing word associations, context, and structure in language.
- It powers algorithms for keyword extraction, topic modeling, and recommendation.
- Modern word embeddings (like Word2Vec) are based on these statistics.
- Understanding how words group and cluster helps you move beyond isolated word counts.
Python example
Let’s walk through a simple bigram co-occurrence counter for a text with stop words removed. We’ll use everything you learned up to this point.
Step 8.1 — Clean, tokenize, and filter the text
import stringtext = "Learning NLP is practical and fun. Fun projects make learning NLP easier!"stop_words = set([ 'the', 'is', 'in', 'it', 'by', 'and', 'a', 'of', 'to', 'for', 'on', 'o', 'a', 'de', 'em', 'para'])def preprocess(text): clean = text.strip().lower() translator = str.maketrans('', '', string.punctuation) no_punct = clean.translate(translator) tokens = no_punct.split() filtered = [w for w in tokens if w not in stop_words] return filteredtokens = preprocess(text)print("Filtered tokens:", tokens)
Step 8.2 — Count co-occurring (bigram) pairs
bigram_counts = {}for i in range(len(tokens) - 1): pair = (tokens[i], tokens[i+1]) bigram_counts[pair] = bigram_counts.get(pair, 0) + 1print("Bigram co-occurrences:", bigram_counts)
Line-by-line explanation of the code
- Import and define stop words:
As before, punctuation is stripped and common words are filtered out. preprocessfunction:- Cleans and tokenizes the text, then removes stop words.
- Bigram counting:
- Loop over
range(len(tokens) - 1)so each token can be paired with its immediate neighbor. - For each index
i, make a tuple (pair) of tokens:tokens[i],tokens[i+1]. - Update a dictionary with pairs as keys and counts as values.
- Printing results:
Shows a dictionary where keys are bigram word pairs and values are their frequencies.
Practical notes
- This method only finds bigrams (adjacent words). For wider context, increase the window size (e.g., all pairs within ±2 positions).
- You can adapt this to texts in Portuguese or any other language—just update your sample and stop word set.
- The result can be used to build network graphs, visualize common word pairs, or feed algorithms like Markov models.
Suggested mini exercise
- Change the example text and observe new bigrams.
- Try with a short Portuguese sentence, using Portuguese stop words.
- Alter the script to skip over bigrams if either word is a stop word (extra challenge: filter both before counting).
- Print the top N most common bigrams from a longer text.
Conclusion
You’ve now moved beyond simple statistics and learned to analyze text structure: which pairs of words appear together, and how often? Co-occurrence analysis opens the door to advanced NLP techniques and richer representations of text. In the next step, you’ll start to visualize these bigrams, building the intuition needed for context-aware and semantic text analysis.
