Step 2 — Basic Text Normalization and Reliable Token Lists in Python

Introduction

After creating a safe Python environment and running our first basic script, it’s time to address the heart of many NLP tasks: breaking raw text into usable units (tokens) and applying simple cleaning to ensure consistency for analysis. In this step, we will take a simple text, clean it more robustly, and produce an initial list of tokens (words), preparing the foundation for frequency analysis and further NLP processing.


Main concept explained clearly

In natural language processing, tokenization is the process of splitting text into smaller pieces, usually words, called “tokens.” However, to avoid inconsistent results, we also need to standardize the text beforehand. This often includes:

  1. Lowercasing — Treats “Python” and “python” as the same word.
  2. Removing punctuation — Prevents “Python!” and “Python” from counting as different.
  3. Stripping extra whitespace — Consolidates texts into a normalized, clean shape.
  4. Splitting into tokens — Converts the text into a list of words.

Most real-world texts contain punctuation marks, numbers, and other non-alphabetic characters that can confuse basic tokenization. Our task is to implement a simple normalization routine using only built-in Python tools, no extra libraries yet.


Why this matters in NLP

Without normalization and tokenization, textual data remains inconsistent and full of noise. For example:

  • “NLP, in 2024!” and “nlp in 2024” would be treated as different unless cleaned and lowercased.
  • Counting word frequencies, searching, or any analysis becomes unreliable if the text isn’t regularized.
  • Many machine learning algorithms expect input as a list of tokens or features, not a messy block of text.

Laying a solid normalization foundation in these early steps ensures your future NLP explorations are reliable and reproducible.


Python example

Let’s develop the normalization and tokenization step by step. This approach helps you understand and copy each code block in sequence.

Step 2.1 — Lowercase and strip whitespace

text = " NLP, in 2024! Is REWARDING. "
clean_text = text.strip().lower()
print("After strip and lower:", clean_text)

Step 2.2 — Remove punctuation with str.translate

import string
# Create a translation table that maps punctuation to None
translator = str.maketrans('', '', string.punctuation)
no_punct_text = clean_text.translate(translator)
print("After removing punctuation:", no_punct_text)

Step 2.3 — Split into tokens

tokens = no_punct_text.split()
print("Token list:", tokens)

Step 2.4 — All together

“`python name=step02_basic_normalization.py
import string

text = ” NLP, in 2024! Is REWARDING. “

1. Lowercase and strip

clean_text = text.strip().lower()

2. Remove punctuation

translator = str.maketrans(”, ”, string.punctuation)
no_punct_text = clean_text.translate(translator)

3. Split into tokens (words)

tokens = no_punct_text.split()

print(“Original text: “, repr(text))
print(“Cleaned text: “, repr(no_punct_text))
print(“Token list: “, tokens)
“`


Line-by-line explanation of the code

  • import string
    Imports a built-in Python module with a constant string.punctuation listing all common punctuation characters.
  • text = ...
    Stores our sample sentence, complete with extra spaces and punctuation.
  • clean_text = text.strip().lower()
    Removes leading and trailing whitespace and converts all letters to lowercase for uniformity.
  • translator = str.maketrans('', '', string.punctuation)
    Prepares a translation table to remove all punctuation. This is a safe, built-in way to erase characters like commas, periods, exclamation marks, etc.
  • no_punct_text = clean_text.translate(translator)
    Applies the translation table, resulting in text with no punctuation.
  • tokens = no_punct_text.split()
    Splits the cleaned text on spaces to produce a simple list of words (tokens).
  • print(...)
    Outputs each stage: the original, cleaned, and tokenized text.

Practical notes

  • string.punctuation includes: !”#$%&'()*+,-./:;<=>?@[]^_`{|}~
  • Numbers are not removed in this step—”2024″ is kept as a token.
  • This method does not handle complex languages or contractions (e.g., “don’t”), but it works well for initial, English-like text normalization.
  • For many tasks, this level of cleaning is sufficient—further precision can be added with libraries in future steps.

Suggested mini exercise

  1. Change text to a sentence in Portuguese (or a mix of Portuguese and English) with punctuation. Example:
    " Aprender NLP em 2024! É muito útil. "
Edvaldo Guimrães Filho Avatar

Published by