Step 3 — Counting Word Frequencies in Text with Python
Introduction
After learning how to normalize text and break it down into reliable tokens, the next essential milestone on your NLP journey is counting the frequency of each word. Counting word occurrences is a classic analysis task—used everywhere from simple datasets to advanced language modeling. In this step, we’ll create a compact Python script to transform a text into a frequency dictionary, preparing you for dataset exploration and more complex NLP workflows in the future.
Main concept explained clearly
Word frequency analysis means measuring how many times each unique word appears in a text. The result is often stored as a dictionary (also called a “hash map”) where each key is a word, and the value is the count.
Why is this step useful?
- It immediately shows what terms are most prominent in your corpus.
- Many NLP techniques (like stop-word filtering, keyword extraction, and even machine learning features) start from frequency statistics.
- It’s a baseline for comparing text samples, identifying patterns, or even plotting word clouds.
We’ll use a standard approach:
- Store a text.
- Normalize and tokenize (reviewing what you learned in Step 2).
- Build a word frequency dictionary using basic Python tools.
Why this matters in NLP
Understanding which words dominate your data is central for:
- Summarizing and exploring datasets.
- Filtering out “stop words” (like “the”, “and”, “de”, “a”, etc.).
- Building vector representations of text for algorithms.
- Understanding context, focus, and even bias in documents.
This is the most common first analysis in any text science work—both because it’s simple and because it gives actionable insights fast.
Python example
Let’s go step-by-step—you can copy and try each block as you learn.
Step 3.1 — Prepare your normalized token list
If you completed Step 2, you already have code to normalize and tokenize:
import stringtext = "NLP, in 2024! NLP is practical. Learning NLP step by step makes it easier."# 1. Normalizeclean_text = text.strip().lower()translator = str.maketrans('', '', string.punctuation)no_punct_text = clean_text.translate(translator)# 2. Tokenizetokens = no_punct_text.split()print("Tokens:", tokens)
Step 3.2 — Count word frequencies (the simple way)
Loop through your tokens and accumulate counts in a dictionary:
word_freq = {} # Dictionary to hold word countsfor word in tokens: if word in word_freq: word_freq[word] += 1 else: word_freq[word] = 1print("Word frequencies:", word_freq)
Step 3.3 — All together
Here’s the complete script for Step 3:
“`python name=step03_word_frequency.py
import string
text = “NLP, in 2024! NLP is practical. Learning NLP step by step makes it easier.”
1. Normalize
clean_text = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct_text = clean_text.translate(translator)
2. Tokenize
tokens = no_punct_text.split()
3. Count frequencies
word_freq = {}
for word in tokens:
if word in word_freq:
word_freq[word] += 1
else:
word_freq[word] = 1
print(“Token list:”, tokens)
print(“Word frequencies:”, word_freq)
“`
Line-by-line explanation of the code
import string
To accessstring.punctuationfor punctuation removal.text = ...
The sample text to process.clean_text = text.strip().lower()
Trims spaces and standardizes capitalization.translator = str.maketrans('', '', string.punctuation)
Prepares a removal table for all punctuation.no_punct_text = clean_text.translate(translator)
Removes punctuation for reliable tokenization.tokens = no_punct_text.split()
Splits the cleaned text into basic word tokens.word_freq = {}
Initializes an empty dictionary to store the word counts.- The
for word in tokens:loop
Processes each token:- If the word is seen for the first time, add to the dictionary with value 1.
- If it exists, increment the count by 1.
print(...)
Displays the token list and the resulting frequency dictionary.
Practical notes
- The dictionary stores frequencies efficiently. E.g.:
{ 'nlp': 3, 'in': 1, '2024': 1, ... } - With small adjustments, you could sort the words by frequency (we’ll cover this soon).
- Numbers, single-character words, or stop-words will be included for now—this is intentional. We’ll filter them out in later steps.
- For larger corpora, Python’s
collections.Counterprovides a shortcut; but mastering the basic logic teaches you more at this stage.
Suggested mini exercise
- Change the sample
textto a sentence or short paragraph in Portuguese, or blend both languages. Try repeated words and some punctuation. - Add more unique words and see the counts update.
- Try a text with numbers. Observe if numbers are counted as words.
- (Challenge) Print only the most frequent word(s) by looping through
word_freq.
Conclusion
You’ve added a new, vital tool to your NLP arsenal: measuring word frequencies. This forms the basis of almost all further text analytics—the very first thing professionals look at when they receive new datasets. In future steps, you’ll learn to filter out irrelevant words (“stop words”), sort and visualize frequencies, and move toward more scalable statistics.
