Step 3 — Counting Word Frequencies in Text with Python

Introduction

After learning how to normalize text and break it down into reliable tokens, the next essential milestone on your NLP journey is counting the frequency of each word. Counting word occurrences is a classic analysis task—used everywhere from simple datasets to advanced language modeling. In this step, we’ll create a compact Python script to transform a text into a frequency dictionary, preparing you for dataset exploration and more complex NLP workflows in the future.


Main concept explained clearly

Word frequency analysis means measuring how many times each unique word appears in a text. The result is often stored as a dictionary (also called a “hash map”) where each key is a word, and the value is the count.

Why is this step useful?

  • It immediately shows what terms are most prominent in your corpus.
  • Many NLP techniques (like stop-word filtering, keyword extraction, and even machine learning features) start from frequency statistics.
  • It’s a baseline for comparing text samples, identifying patterns, or even plotting word clouds.

We’ll use a standard approach:

  1. Store a text.
  2. Normalize and tokenize (reviewing what you learned in Step 2).
  3. Build a word frequency dictionary using basic Python tools.

Why this matters in NLP

Understanding which words dominate your data is central for:

  • Summarizing and exploring datasets.
  • Filtering out “stop words” (like “the”, “and”, “de”, “a”, etc.).
  • Building vector representations of text for algorithms.
  • Understanding context, focus, and even bias in documents.

This is the most common first analysis in any text science work—both because it’s simple and because it gives actionable insights fast.


Python example

Let’s go step-by-step—you can copy and try each block as you learn.

Step 3.1 — Prepare your normalized token list

If you completed Step 2, you already have code to normalize and tokenize:

import string
text = "NLP, in 2024! NLP is practical. Learning NLP step by step makes it easier."
# 1. Normalize
clean_text = text.strip().lower()
translator = str.maketrans('', '', string.punctuation)
no_punct_text = clean_text.translate(translator)
# 2. Tokenize
tokens = no_punct_text.split()
print("Tokens:", tokens)

Step 3.2 — Count word frequencies (the simple way)

Loop through your tokens and accumulate counts in a dictionary:

word_freq = {} # Dictionary to hold word counts
for word in tokens:
if word in word_freq:
word_freq[word] += 1
else:
word_freq[word] = 1
print("Word frequencies:", word_freq)

Step 3.3 — All together

Here’s the complete script for Step 3:

“`python name=step03_word_frequency.py
import string

text = “NLP, in 2024! NLP is practical. Learning NLP step by step makes it easier.”

1. Normalize

clean_text = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct_text = clean_text.translate(translator)

2. Tokenize

tokens = no_punct_text.split()

3. Count frequencies

word_freq = {}
for word in tokens:
if word in word_freq:
word_freq[word] += 1
else:
word_freq[word] = 1

print(“Token list:”, tokens)
print(“Word frequencies:”, word_freq)
“`


Line-by-line explanation of the code

  • import string
    To access string.punctuation for punctuation removal.
  • text = ...
    The sample text to process.
  • clean_text = text.strip().lower()
    Trims spaces and standardizes capitalization.
  • translator = str.maketrans('', '', string.punctuation)
    Prepares a removal table for all punctuation.
  • no_punct_text = clean_text.translate(translator)
    Removes punctuation for reliable tokenization.
  • tokens = no_punct_text.split()
    Splits the cleaned text into basic word tokens.
  • word_freq = {}
    Initializes an empty dictionary to store the word counts.
  • The for word in tokens: loop
    Processes each token:
    • If the word is seen for the first time, add to the dictionary with value 1.
    • If it exists, increment the count by 1.
  • print(...)
    Displays the token list and the resulting frequency dictionary.

Practical notes

  • The dictionary stores frequencies efficiently. E.g.:
    { 'nlp': 3, 'in': 1, '2024': 1, ... }
  • With small adjustments, you could sort the words by frequency (we’ll cover this soon).
  • Numbers, single-character words, or stop-words will be included for now—this is intentional. We’ll filter them out in later steps.
  • For larger corpora, Python’s collections.Counter provides a shortcut; but mastering the basic logic teaches you more at this stage.

Suggested mini exercise

  1. Change the sample text to a sentence or short paragraph in Portuguese, or blend both languages. Try repeated words and some punctuation.
  2. Add more unique words and see the counts update.
  3. Try a text with numbers. Observe if numbers are counted as words.
  4. (Challenge) Print only the most frequent word(s) by looping through word_freq.

Conclusion

You’ve added a new, vital tool to your NLP arsenal: measuring word frequencies. This forms the basis of almost all further text analytics—the very first thing professionals look at when they receive new datasets. In future steps, you’ll learn to filter out irrelevant words (“stop words”), sort and visualize frequencies, and move toward more scalable statistics.

Edvaldo Guimrães Filho Avatar

Published by