Step 5 — Filtering Out Stop Words in Your Python NLP Pipeline

Introduction

Welcome to Step 5 of our practical NLP journey in Python! In previous steps, you learned how to normalize text, tokenize it, and count word frequencies. Now, it’s time to upgrade your analysis by filtering out “stop words.” These are extremely common words (like “the”, “and”, “is”, “a”) that don’t carry much meaning and often clutter frequency analyses. By removing stop words, you sharpen your results and focus on the informative words that give your texts their true meaning.


Main concept explained clearly

Stop words are words that appear frequently in a language but add little value to the meaning or context—examples in English include “the”, “a”, “to”, “in”, “for”, “on”, etc. In Portuguese, you have words like “o”, “a”, “de”, “em”, “para”. Filtering them out is a classic NLP process.

For beginners, you can simply define a small built-in list yourself. Later, powerful libraries such as NLTK or spaCy provide full lists—but our goal for Step 5 is to understand the process practically, with minimal dependencies.

The task is:

  1. Build or use a stop word set (a Python set or list).
  2. Remove these words from your tokenized text, keeping only “meaningful” words.
  3. Optionally, recalculate word frequencies with the filtered tokens.

Why this matters in NLP

  • Stop words dominate frequency counts and can hide significant terms.
  • Filtering stop words makes keyword extraction, document clustering, search, and classification much more accurate.
  • The process lays the groundwork for “text vectorization”—a core step in preparing text for machine learning.

Python example

Let’s enhance your script step by step, focusing on clarity. We’ll use a small, custom stop word set for demonstration.

Step 5.1 — Define a stop word set

For practice, start with a short list; later you can expand.

stop_words = set([
'the', 'is', 'in', 'it', 'by', 'and', 'a', 'of', 'to', 'for', 'on', 'o', 'a', 'de', 'em', 'para'
])

Step 5.2 — Filter tokens

tokens = ['nlp', 'in', '2024', 'nlp', 'is', 'practical', 'learning', 'nlp', 'step', 'by', 'step', 'makes', 'it', 'easier']
filtered_tokens = [word for word in tokens if word not in stop_words]
print("Filtered token list:", filtered_tokens)

Step 5.3 — Count frequencies for filtered tokens

word_freq = {}
for word in filtered_tokens:
word_freq[word] = word_freq.get(word, 0) + 1
print("Word frequencies (stop words removed):", word_freq)

Step 5.4 — Complete script

“`python name=step05_filter_stop_words.py
import string

Sample text (can be in English or Portuguese)

text = “NLP, in 2024! NLP is practical. Learning NLP step by step makes it easier.”

Normalize and tokenize

clean_text = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct_text = clean_text.translate(translator)
tokens = no_punct_text.split()

Define stop words

stop_words = set([
‘the’, ‘is’, ‘in’, ‘it’, ‘by’, ‘and’, ‘a’, ‘of’, ‘to’, ‘for’, ‘on’, ‘o’, ‘a’, ‘de’, ’em’, ‘para’
])

Filter stop words

filtered_tokens = [word for word in tokens if word not in stop_words]

Calculate word frequencies

word_freq = {}
for word in filtered_tokens:
word_freq[word] = word_freq.get(word, 0) + 1

print(“Original tokens:”, tokens)
print(“Filtered token list:”, filtered_tokens)
print(“Word frequencies (stop words removed):”, word_freq)
“`


Line-by-line explanation of the code

  • import string: brings in punctuation tools.
  • text = ...: provides a sample text for analysis.
  • clean_text = text.strip().lower(): prepares the text by trimming spaces and lowercasing.
  • translator = ...: sets up punctuation removal.
  • no_punct_text = clean_text.translate(translator): removes punctuation.
  • tokens = no_punct_text.split(): splits text into individual tokens.
  • stop_words = set([...]): creates a set of stop words to exclude.
  • filtered_tokens = [word for word in tokens if word not in stop_words]: creates a list of tokens not present in the stop word set.
  • Frequency calculation: creates a dictionary counting each word’s occurrence in filtered_tokens.
  • print(...): shows the results at each stage.

Practical notes

  • The set of stop words can easily be expanded or replaced by a library later.
  • Sets are efficient for membership checking (avoid lists for large stop word collections).
  • Filtering stop words before frequency analysis leads to more meaningful results.
  • For Portuguese texts, add more relevant stop words—experiment according to your task!

Suggested mini exercise

  1. Extend the stop word list with 10 common Portuguese words (e.g., “como”, “mas”, “eu”, “meu”, “ela”, etc.).
  2. Modify the text variable to include both English and Portuguese stop words—see them disappear!
  3. Try printing the top 3 most common words in the filtered frequency dictionary.
  4. Change the script so it shows both filtered and unfiltered frequency results side by side.

Conclusion

Removing stop words is a simple but powerful move to focus your NLP analyses on meaningful content. You’ve now learned how to identify, filter, and re-count word frequencies for a sharper, cleaner dataset. The skills in this step prepare you for key tasks in information retrieval, search engines, and text classification.

In Step 6, you’ll learn how to visualize word frequencies using bar charts or simple plots—a big leap toward more insightful data analysis. Say “next” when you’re ready!

Edvaldo Guimrães Filho Avatar

Published by