Step 4 — Sorting and Displaying the Most Common Words in Text (with Python)

Introduction

In the previous step, you learned how to count word frequencies and store the results in a dictionary. Now, you’ll tackle a new and practical skill for text analytics: sorting and displaying the most common words in your data. This is crucial for anyone who wants to quickly summarize, visualize, or compare texts. In this step, you’ll write a Python script to find and display the “top N” words—something you’ll use in all kinds of NLP explorations.


Main concept explained clearly

A frequency dictionary gives you each unique word and how many times it appeared, but dictionaries are unordered by default in Python (pre-3.7). To answer, “What are the most (or least) frequent words?”, you need to:

  1. Extract items (word/count pairs) from the dictionary.
  2. Sort them by the count (usually descending, for most common first).
  3. Display the top N (such as top 5 or top 10).

This step uses two powerful Python skills:

  • The sorted() function with the key parameter.
  • The items() method of dictionaries to loop over (word, count) pairs.

Why this matters in NLP

  • Top-word analysis is a classic first “exploration” for any new dataset, giving an immediate sense of document focus and vocabulary.
  • Plotting, word clouds, and keyword extraction all start with sorted frequency lists.
  • Removing, filtering, or flagging the most frequent (or rare) terms is a starting point for preprocessing, stop-word filtering, or feature selection.
  • Comparing top words between corpora (e.g., spam emails vs. real emails) is a basic and powerful data science move.

Python example

We’ll use the word frequency code from Step 3, and add the logic to sort and display the top N items.

Step 4.1 — Existing frequency dictionary (from last step)

word_freq = {'nlp': 3, 'in': 1, '2024': 1, 'is': 1, 'practical': 1, 'learning': 1, 'step': 2, 'by': 1, 'makes': 1, 'it': 1, 'easier': 1}

Step 4.2 — Sort by frequency (descending)

# sorted() sorts list of tuples (word, freq), highest first
sorted_words = sorted(word_freq.items(), key=lambda item: item[1], reverse=True)
print(sorted_words)

Step 4.3 — Display the top N words

top_n = 5
print(f"Top {top_n} most common words:")
for word, count in sorted_words[:top_n]:
print(f"{word}: {count}")

Step 4.4 — All steps together in a reusable script

“`python name=step04_sorting_top_words.py
import string

text = “NLP, in 2024! NLP is practical. Learning NLP step by step makes it easier.”

1. Normalize and tokenize (review of steps 2/3)

clean_text = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct_text = clean_text.translate(translator)
tokens = no_punct_text.split()

2. Count word frequencies

word_freq = {}
for word in tokens:
word_freq[word] = word_freq.get(word, 0) + 1

3. Sort by frequency

sorted_words = sorted(word_freq.items(), key=lambda item: item[1], reverse=True)

4. Display top N words

top_n = 5
print(f”Top {top_n} most common words:”)
for word, count in sorted_words[:top_n]:
print(f”{word}: {count}”)
“`


Line-by-line explanation of the code

  • import string
    Imports standard string manipulation tools.
  • text = ...
    The sample text to process.
  • clean_text = text.strip().lower()
    Removes outer spaces and standardizes capitalization.
  • translator = str.maketrans('', '', string.punctuation)
    Builds a table for removing punctuation.
  • no_punct_text = clean_text.translate(translator)
    Removes punctuation, making split reliable.
  • tokens = no_punct_text.split()
    Splits the normalized, clean text into words.
  • Word frequency calculation (as in Step 3):
  • word_freq[word] = word_freq.get(word, 0) + 1
    Adds 1 to a word’s count (if new, starts at 0).
  • sorted_words = sorted(word_freq.items(), key=lambda item: item[1], reverse=True)
  • Converts the dictionary to a list of (word, count) tuples and sorts by count descending.
  • The key argument tells sorted() to use the frequency (item[1]).
  • Looping through sorted_words[:top_n]
  • Shows only the top N items.

Practical notes

  • If N is larger than the number of unique words, all words are shown.
  • You can show the least common words by changing reverse=True to reverse=False.
  • For real text, high-frequency words may be “stop words” (e.g., “the”, “is”, “and”)—in future steps, you’ll learn to filter those.
  • You can easily adapt the code to print out both the top and bottom words, or to display percentages.

Suggested mini exercise

  1. Replace text with a paragraph in Portuguese (or mixed language).
  2. Change top_n to see more or fewer top words.
  3. Try displaying the least frequent words (set reverse=False).
  4. For a fun experiment, try a really short text and see how sorting changes.
  5. For extra learning, print all counts tied for the most frequent word (if there’s a tie).

Conclusion

You now have a mini toolkit for frequency-based word analysis—a high-impact routine that’s useful for EDA (Exploratory Data Analysis) in NLP, digital humanities, data science, and “real world” text mining. By learning to sort and display the most common words, you are ready to begin exploring, summarizing, and comparing documents effectively.

In the next step, you’ll address one of the classic challenges: removing “stop words” (very common, uninformative words) from your analysis to sharpen your results even more. When you’re ready, just say “next” to proceed!

Edvaldo Guimrães Filho Avatar

Published by