Installing NLTK and Building Your First NLP Script in Python

When people first hear about Natural Language Processing, the subject can sound much more complicated than it really is at the beginning. In practice, your first step into NLP with Python can be very simple: install a library, load a sentence, split that sentence into words, and inspect the result.

That is exactly why NLTK is such a good starting point.

In this article, we will install NLTK, download the resources it needs, and create a first Python script that performs a basic text-processing task. The goal is not to build an advanced AI model yet. The goal is to create your first working NLP program and understand what each line is doing.

This is one of the most important moments in the learning journey, because it is where NLP stops being theory and starts becoming code.


Why Start with a Simple Script?

Many beginners make the mistake of jumping too early into complex models, sentiment analysis, or machine learning pipelines. That usually creates confusion.

A much better approach is to begin with a small script that teaches three essential ideas:

  • text is data
  • text can be split into smaller parts
  • those parts can be analyzed by Python

Once you understand this, the rest of NLP becomes much easier to learn.

That is why your first script matters. It teaches the logic that everything else will build on later.


What You Need Before Starting

To follow this tutorial, you only need:

  • Python installed on your machine
  • access to the terminal or command prompt
  • a code editor such as VS Code, PyCharm, or even IDLE

You do not need advanced math, machine learning knowledge, or a large dataset. A single sentence is enough for this first exercise.


Step 1: Install NLTK

The first step is installing the NLTK library.

Open your terminal or command prompt and run:

pip install nltk

This command downloads and installs the core NLTK package in your Python environment.

If the installation works correctly, Python will now recognize the nltk module.


Step 2: Understand That NLTK Also Needs Language Resources

Installing the library is only part of the setup.

NLTK uses additional resources for tasks such as:

  • tokenization
  • stopword handling
  • lexical databases
  • grammatical tagging

These resources are downloaded separately. This is one reason NLTK is educational: it helps you see that the library code and the language data are not the same thing.

For your first script, the most important resource is usually the tokenizer data.


Step 3: Download the Tokenizer Resource

Create a small temporary Python file or open a Python interpreter and run:

import nltk
nltk.download('punkt')

This downloads the tokenizer package commonly used for sentence and word tokenization.

If everything works, NLTK will save the required resource locally so your scripts can use it later.

In many setups, this is enough for your first basic tokenization examples.


Step 4: Create Your First Python NLP Script

Now let us write a very small script.

Create a file called:

first_nltk_script.py

Inside it, write the following code:

import nltk
from nltk.tokenize import word_tokenize
text = "Python is amazing for natural language processing."
tokens = word_tokenize(text)
print("Original text:")
print(text)
print("\nTokens:")
print(tokens)

This is your first NLP script with NLTK.


Understanding the Script Line by Line

Let us break it down carefully.

import nltk

This imports the NLTK library into your program.

Even if you do not directly use nltk in every line, importing it is part of the standard setup and reminds you that the script is built around this toolkit.

from nltk.tokenize import word_tokenize

This imports the word_tokenize function, which is one of the most useful beginner tools in NLP.

Its job is to split text into smaller pieces called tokens.

A token is usually a word, but in practice it may also be punctuation.

text = "Python is amazing for natural language processing."

Here we define a string. At this stage, it is just ordinary Python text.

tokens = word_tokenize(text)

This is the key line.

The function takes the raw sentence and transforms it into a list of tokens.

Instead of seeing one large string, Python now sees individual pieces that can be counted, compared, filtered, or classified.

print(...)

These lines simply display the original text and the list of tokens, so you can see the transformation clearly.


Expected Output

When you run the script, the output should look similar to this:

Original text:
Python is amazing for natural language processing.
Tokens:
['Python', 'is', 'amazing', 'for', 'natural', 'language', 'processing', '.']

This output is extremely important for a beginner.

Why?

Because it shows the first major idea in NLP:

a sentence can be turned into a structured list of elements.

That is the beginning of computational language analysis.


What Tokenization Really Means

Tokenization is one of the most fundamental operations in NLP.

When humans read a sentence, we naturally perceive words and punctuation. A computer does not naturally do that in the same way. To the computer, text is initially just a sequence of characters.

Tokenization is the step where we tell the computer how to separate that stream into meaningful units.

For example:

Python is amazing for natural language processing.

becomes:

['Python', 'is', 'amazing', 'for', 'natural', 'language', 'processing', '.']

Now each piece can be handled individually.

That means you can later:

  • count words
  • remove punctuation
  • convert words to lowercase
  • search for specific terms
  • build word frequency tables
  • remove stopwords
  • prepare data for machine learning

So even though tokenization looks simple, it is one of the foundation stones of NLP.


Running the Script

To run the file, open your terminal in the folder where the script is saved and execute:

python first_nltk_script.py

Depending on your environment, you may need:

python3 first_nltk_script.py

If the output displays correctly, then you have successfully completed your first practical step in NLP with Python.

That is a real milestone.


Common Beginner Error: Missing Resource

A very common problem happens when the script is correct, but the required tokenizer data has not been downloaded yet.

In that case, Python may show an error indicating that a resource is missing.

The solution is usually to run:

import nltk
nltk.download('punkt')

and then try again.

This is one of the first practical lessons in working with NLP libraries: sometimes the code is fine, but the environment still needs language resources.


Improving the First Script

Once the basic version works, you can create a slightly more informative version.

import nltk
from nltk.tokenize import word_tokenize
text = "Python is amazing for natural language processing."
tokens = word_tokenize(text)
print("Original text:")
print(text)
print("\nNumber of characters:")
print(len(text))
print("\nNumber of tokens:")
print(len(tokens))
print("\nTokens:")
print(tokens)

Now the script does a little more:

  • prints the original sentence
  • counts the number of characters
  • counts the number of tokens
  • shows the token list

This turns the example into a miniature text-analysis program.


Why This Small Program Matters

Some beginners may look at this and think, “This is too simple.”

It is simple. And that is exactly why it is valuable.

This tiny program teaches several important principles at once:

1. Python can read language as data

The sentence is no longer just text for a human. It becomes something Python can process.

2. NLP begins with transformation

Before analysis, the text must be converted into a more usable structure.

3. Lists are central in text processing

After tokenization, the result is a Python list. That means all your normal Python list knowledge becomes useful in NLP.

4. Small steps build strong foundations

Advanced NLP systems are made of many small processing stages. This first script introduces that mindset.


A Slightly More Interactive Version

You can also let the user type their own sentence.

import nltk
from nltk.tokenize import word_tokenize
text = input("Enter a sentence: ")
tokens = word_tokenize(text)
print("\nOriginal text:")
print(text)
print("\nTokens:")
print(tokens)
print("\nNumber of tokens:")
print(len(tokens))

This version is useful because it turns the script into a tiny terminal app.

Now you can test different sentences and observe how tokenization behaves.

This is a great learning habit: do not just run one example. Try many.

Test things like:

  • short sentences
  • long sentences
  • punctuation
  • repeated words
  • uppercase and lowercase text
  • questions and exclamations

The more you experiment, the better you understand how the tokenizer works.


What You Have Learned So Far

By this point, you have already touched several core NLP ideas:

  • installing a Python NLP library
  • importing specific tools from a module
  • preparing the environment
  • processing raw text
  • tokenizing a sentence
  • turning unstructured text into a structured list

This may look small, but it is a strong beginning.

A learner who truly understands this step will have a much easier time with later concepts such as:

  • stopword removal
  • frequency analysis
  • stemming
  • lemmatization
  • part-of-speech tagging
  • sentiment analysis
  • text classification

Suggested Practice Exercises

To make this lesson stronger, try a few simple variations on your own.

Exercise 1

Change the sentence and test the output with punctuation.

Example:

text = "Hello, world! Python is fun."

Exercise 2

Print each token on a separate line.

for token in tokens:
print(token)

Exercise 3

Count how many times a specific word appears.

You can start exploring this after tokenization.

Exercise 4

Convert the whole sentence to lowercase before tokenizing.

This introduces the idea of normalization.


A Good Beginner Mindset

At this stage, do not worry about “real AI” yet.

Your main goal should be to build comfort with the workflow:

  • write a script
  • test a sentence
  • inspect the output
  • change the input
  • observe what changes

This is how technical intuition is built.

Many people stay stuck because they keep reading theory without writing small working programs. This first script is important because it breaks that barrier.

You are no longer only reading about NLP. You are doing NLP.


Conclusion

Installing NLTK and writing a first tokenization script is one of the best ways to begin learning Natural Language Processing with Python. It is simple, practical, and conceptually rich.

With just a few lines of code, you learn that text can be transformed into tokens, that NLP begins with structure, and that even a short sentence can become analyzable data.

This first script may look small, but it teaches the core habit that will support everything later: breaking language into steps that Python can understand.

That is the real beginning of NLP.


Full Code

Here is the final beginner version again for easy copy and use:

import nltk
from nltk.tokenize import word_tokenize
text = "Python is amazing for natural language processing."
tokens = word_tokenize(text)
print("Original text:")
print(text)
print("\nNumber of characters:")
print(len(text))
print("\nNumber of tokens:")
print(len(tokens))
print("\nTokens:")
print(tokens)

Edvaldo Guimrães Filho Avatar

Published by