Step 01 — First Contact with Text in Python (Strings) + Repository Scaffold (pyproject.toml)
Introduction
This 50-step project is a practical, beginner-friendly path to learning Natural Language Processing (NLP) in Python by building one small app per step. In Step 01, we start from the real foundation: handling text as data.
In Python, text is represented by the type str (string). Before we talk about tokenization, stopwords, TF‑IDF, or machine learning, you need to feel comfortable storing, printing, combining, and transforming strings.
This article also includes a repository scaffold using pyproject.toml, so you can keep all 50 steps organized and runnable in a consistent way.
Main concept explained clearly
What is a string in Python?
A string (str) is Python’s built-in type for text. It’s essentially a sequence of characters, including:
- letters (
a,b,c) - spaces (
" ") - punctuation (
.,,,!) - accents and Unicode characters (
ç,á) - symbols (
@,#,$)
Key properties you should know
- Strings are immutable: operations like
.lower()do not change the original string; they return a new string. - Strings can be created with single or double quotes:
'text'or"text". - For combining text with variables, f-strings are the cleanest approach:
f"Hello {name}".
The “NLP mindset” at Step 01
At this stage, your job is to treat text as something you can:
- store in a variable
- inspect and print
- transform (case changes, trimming)
- measure (character length)
These simple actions are the base of almost every NLP pipeline.
Why this matters in NLP
All NLP systems—simple or advanced—begin with raw text input:
- user comments
- product reviews
- customer support messages
- documents (PDF text, web pages)
- chat logs
Before you can extract meaning, you need to reliably do basic operations like:
- normalizing case (
Pythonvspython) - trimming extra whitespace
- preparing text for later steps like tokenization
- computing basic metrics (length, counts)
If you skip this foundation, later steps will feel confusing because most NLP “magic” is really just systematic text transformations.
Python example
This is your Step 01 mini app. It accepts text from the command line (or uses a default sentence), then prints basic information and transformations.
“`python name=apps/step_01_first_contact/app.py
import sys
def main() -> None:
“””
Step 01 – First contact with text in Python.
Run examples: python -m apps.step_01_first_contact.app python -m apps.step_01_first_contact.app "NLP starts with strings""""if len(sys.argv) > 1: text = " ".join(sys.argv[1:])else: text = "NLP starts with strings in Python."print("=== Step 01: First Contact with Text (Strings) ===")print(f"Raw text: {text}")print(f"Python type: {type(text)}")prefix = "Input received:"combined = f"{prefix} {text}"print("\n--- Basic operations ---")print(f"Concatenation (f-string): {combined}")print(f"Lowercase: {text.lower()}")print(f"Uppercase: {text.upper()}")print(f"Trimmed (strip): {text.strip()}")print("\n--- Quick preview metric ---")print(f"Character count (len): {len(text)}")
if name == “main“:
main()
---## Line-by-line explanation of the code- `import sys` Imports the `sys` module so the program can read command-line arguments (`sys.argv`).- `def main() -> None:` Defines the main function. This structure makes your scripts easier to grow and keep clean.- `if len(sys.argv) > 1:` Checks whether the user passed text in the command line.- `text = " ".join(sys.argv[1:])` Joins all command-line tokens into one string. This lets you run: - `python -m ... "hello world"` - or even `python -m ... hello world` (without quotes)- `print(f"Python type: {type(text)}")` Confirms that the input is a `str` (string).- `combined = f"{prefix} {text}"` Demonstrates the recommended way to concatenate strings with variables (f-strings).- `text.lower()` and `text.upper()` Shows case transformations—essential later for normalization.- `text.strip()` Removes whitespace at the start and end—very common in real-world text cleaning.- `len(text)` Counts characters (including spaces and punctuation). This leads naturally to Step 02.---## Practical notes- Prefer **f-strings** over `"a" + b + "c"` because they are more readable and less error-prone.- Remember: `.lower()`, `.upper()`, `.strip()` return **new strings**.- Don’t worry about “NLP libraries” yet. Steps 01–20 are intentionally built with **pure Python** so you learn the fundamentals first.---## Suggested mini exercise1. Run the app with different inputs: - `python -m apps.step_01_first_contact.app "Python is great"` - `python -m apps.step_01_first_contact.app " too many spaces here "`2. Modify the app to also print: - The first 10 characters: `text[:10]` - Whether the text contains `"Python"` (case-sensitive): `"Python" in text`3. Try passing text with punctuation: - `python -m apps.step_01_first_contact.app "Hello, NLP!"` Observe how punctuation affects `len(text)`.---## ConclusionIn Step 01, you learned the essential foundation for NLP in Python:- text is a **string (`str`)**- you can store it, print it, combine it, transform it, and measure it- these operations form the base for every text processing pipelineNext, in **Step 02**, we will go deeper into a simple but important metric: **counting characters** and understanding what text length really means.---# Repository scaffold (pyproject.toml)You requested that each step generates a Python app and that we also generate a repository for the whole series. Here is a minimal, clean scaffold using `pyproject.toml`.## Folder structure
text name=repo-structure.txt
nlp-50-steps-python/
pyproject.toml
README.md
.gitignore
apps/
init.py
step_01_first_contact/
init.py
app.py
README.md
## `pyproject.toml`
toml name=pyproject.toml
[project]
name = “nlp-50-steps-python”
version = “0.1.0”
description = “A 50-step practical learning project for NLP in Python (one mini app per step).”
readme = “README.md”
requires-python = “>=3.10”
authors = [
{ name = “EdvaldoBarbosaGuimaraesFilho” }
]
dependencies = []
[tool.ruff]
line-length = 100
target-version = “py310”
[tool.ruff.lint]
select = [“E”, “F”, “I”]
## `.gitignore`
gitignore name=.gitignore
pycache/
*.pyc
.venv/
.DS_Store
.idea/
.vscode/
## Top-level `README.md`
`markdown name=README.md
NLP in Python — 50 Steps (Practical Project)
This repository is a structured, step-by-step learning journey for Natural Language Processing (NLP) in Python.
How to run a step
Example (Step 01):
python -m apps.step_01_first_contact.app "NLP starts here"
Steps
- Step 01 — First contact with text in Python (strings)
## Step README
markdown name=apps/step_01_first_contact/README.md
Step 01 — First Contact with Text in Python (Strings)
Run:
python -m apps.step_01_first_contact.app "NLP is fun"
““
__init__.py files
“`python name=apps/init.py
python name=apps/step_01_first_contact/init.py
“`
