Step 2 — Basic Text Normalization and Reliable Token Lists in Python
Introduction
After creating a safe Python environment and running our first basic script, it’s time to address the heart of many NLP tasks: breaking raw text into usable units (tokens) and applying simple cleaning to ensure consistency for analysis. In this step, we will take a simple text, clean it more robustly, and produce an initial list of tokens (words), preparing the foundation for frequency analysis and further NLP processing.
Main concept explained clearly
In natural language processing, tokenization is the process of splitting text into smaller pieces, usually words, called “tokens.” However, to avoid inconsistent results, we also need to standardize the text beforehand. This often includes:
- Lowercasing — Treats “Python” and “python” as the same word.
- Removing punctuation — Prevents “Python!” and “Python” from counting as different.
- Stripping extra whitespace — Consolidates texts into a normalized, clean shape.
- Splitting into tokens — Converts the text into a list of words.
Most real-world texts contain punctuation marks, numbers, and other non-alphabetic characters that can confuse basic tokenization. Our task is to implement a simple normalization routine using only built-in Python tools, no extra libraries yet.
Why this matters in NLP
Without normalization and tokenization, textual data remains inconsistent and full of noise. For example:
- “NLP, in 2024!” and “nlp in 2024” would be treated as different unless cleaned and lowercased.
- Counting word frequencies, searching, or any analysis becomes unreliable if the text isn’t regularized.
- Many machine learning algorithms expect input as a list of tokens or features, not a messy block of text.
Laying a solid normalization foundation in these early steps ensures your future NLP explorations are reliable and reproducible.
Python example
Let’s develop the normalization and tokenization step by step. This approach helps you understand and copy each code block in sequence.
Step 2.1 — Lowercase and strip whitespace
text = " NLP, in 2024! Is REWARDING. "clean_text = text.strip().lower()print("After strip and lower:", clean_text)
Step 2.2 — Remove punctuation with str.translate
import string# Create a translation table that maps punctuation to Nonetranslator = str.maketrans('', '', string.punctuation)no_punct_text = clean_text.translate(translator)print("After removing punctuation:", no_punct_text)
Step 2.3 — Split into tokens
tokens = no_punct_text.split()print("Token list:", tokens)
Step 2.4 — All together
“`python name=step02_basic_normalization.py
import string
text = ” NLP, in 2024! Is REWARDING. “
1. Lowercase and strip
clean_text = text.strip().lower()
2. Remove punctuation
translator = str.maketrans(”, ”, string.punctuation)
no_punct_text = clean_text.translate(translator)
3. Split into tokens (words)
tokens = no_punct_text.split()
print(“Original text: “, repr(text))
print(“Cleaned text: “, repr(no_punct_text))
print(“Token list: “, tokens)
“`
Line-by-line explanation of the code
import string
Imports a built-in Python module with a constantstring.punctuationlisting all common punctuation characters.text = ...
Stores our sample sentence, complete with extra spaces and punctuation.clean_text = text.strip().lower()
Removes leading and trailing whitespace and converts all letters to lowercase for uniformity.translator = str.maketrans('', '', string.punctuation)
Prepares a translation table to remove all punctuation. This is a safe, built-in way to erase characters like commas, periods, exclamation marks, etc.no_punct_text = clean_text.translate(translator)
Applies the translation table, resulting in text with no punctuation.tokens = no_punct_text.split()
Splits the cleaned text on spaces to produce a simple list of words (tokens).print(...)
Outputs each stage: the original, cleaned, and tokenized text.
Practical notes
string.punctuationincludes: !”#$%&'()*+,-./:;<=>?@[]^_`{|}~- Numbers are not removed in this step—”2024″ is kept as a token.
- This method does not handle complex languages or contractions (e.g., “don’t”), but it works well for initial, English-like text normalization.
- For many tasks, this level of cleaning is sufficient—further precision can be added with libraries in future steps.
Suggested mini exercise
- Change
textto a sentence in Portuguese (or a mix of Portuguese and English) with punctuation. Example:" Aprender NLP em 2024! É muito útil. "
