Step 4 — Sorting and Displaying the Most Common Words in Text (with Python)
Introduction
In the previous step, you learned how to count word frequencies and store the results in a dictionary. Now, you’ll tackle a new and practical skill for text analytics: sorting and displaying the most common words in your data. This is crucial for anyone who wants to quickly summarize, visualize, or compare texts. In this step, you’ll write a Python script to find and display the “top N” words—something you’ll use in all kinds of NLP explorations.
Main concept explained clearly
A frequency dictionary gives you each unique word and how many times it appeared, but dictionaries are unordered by default in Python (pre-3.7). To answer, “What are the most (or least) frequent words?”, you need to:
- Extract items (word/count pairs) from the dictionary.
- Sort them by the count (usually descending, for most common first).
- Display the top N (such as top 5 or top 10).
This step uses two powerful Python skills:
- The
sorted()function with thekeyparameter. - The
items()method of dictionaries to loop over (word, count) pairs.
Why this matters in NLP
- Top-word analysis is a classic first “exploration” for any new dataset, giving an immediate sense of document focus and vocabulary.
- Plotting, word clouds, and keyword extraction all start with sorted frequency lists.
- Removing, filtering, or flagging the most frequent (or rare) terms is a starting point for preprocessing, stop-word filtering, or feature selection.
- Comparing top words between corpora (e.g., spam emails vs. real emails) is a basic and powerful data science move.
Python example
We’ll use the word frequency code from Step 3, and add the logic to sort and display the top N items.
Step 4.1 — Existing frequency dictionary (from last step)
word_freq = {'nlp': 3, 'in': 1, '2024': 1, 'is': 1, 'practical': 1, 'learning': 1, 'step': 2, 'by': 1, 'makes': 1, 'it': 1, 'easier': 1}
Step 4.2 — Sort by frequency (descending)
# sorted() sorts list of tuples (word, freq), highest firstsorted_words = sorted(word_freq.items(), key=lambda item: item[1], reverse=True)print(sorted_words)
Step 4.3 — Display the top N words
top_n = 5print(f"Top {top_n} most common words:")for word, count in sorted_words[:top_n]: print(f"{word}: {count}")
Step 4.4 — All steps together in a reusable script
“`python name=step04_sorting_top_words.py
import string
text = “NLP, in 2024! NLP is practical. Learning NLP step by step makes it easier.”
1. Normalize and tokenize (review of steps 2/3)
clean_text = text.strip().lower()
translator = str.maketrans(”, ”, string.punctuation)
no_punct_text = clean_text.translate(translator)
tokens = no_punct_text.split()
2. Count word frequencies
word_freq = {}
for word in tokens:
word_freq[word] = word_freq.get(word, 0) + 1
3. Sort by frequency
sorted_words = sorted(word_freq.items(), key=lambda item: item[1], reverse=True)
4. Display top N words
top_n = 5
print(f”Top {top_n} most common words:”)
for word, count in sorted_words[:top_n]:
print(f”{word}: {count}”)
“`
Line-by-line explanation of the code
import string
Imports standard string manipulation tools.text = ...
The sample text to process.clean_text = text.strip().lower()
Removes outer spaces and standardizes capitalization.translator = str.maketrans('', '', string.punctuation)
Builds a table for removing punctuation.no_punct_text = clean_text.translate(translator)
Removes punctuation, making split reliable.tokens = no_punct_text.split()
Splits the normalized, clean text into words.- Word frequency calculation (as in Step 3):
word_freq[word] = word_freq.get(word, 0) + 1
Adds 1 to a word’s count (if new, starts at 0).sorted_words = sorted(word_freq.items(), key=lambda item: item[1], reverse=True)- Converts the dictionary to a list of (word, count) tuples and sorts by count descending.
- The
keyargument tellssorted()to use the frequency (item[1]). - Looping through
sorted_words[:top_n] - Shows only the top
Nitems.
Practical notes
- If
Nis larger than the number of unique words, all words are shown. - You can show the least common words by changing
reverse=Truetoreverse=False. - For real text, high-frequency words may be “stop words” (e.g., “the”, “is”, “and”)—in future steps, you’ll learn to filter those.
- You can easily adapt the code to print out both the top and bottom words, or to display percentages.
Suggested mini exercise
- Replace
textwith a paragraph in Portuguese (or mixed language). - Change
top_nto see more or fewer top words. - Try displaying the least frequent words (set
reverse=False). - For a fun experiment, try a really short text and see how sorting changes.
- For extra learning, print all counts tied for the most frequent word (if there’s a tie).
Conclusion
You now have a mini toolkit for frequency-based word analysis—a high-impact routine that’s useful for EDA (Exploratory Data Analysis) in NLP, digital humanities, data science, and “real world” text mining. By learning to sort and display the most common words, you are ready to begin exploring, summarizing, and comparing documents effectively.
In the next step, you’ll address one of the classic challenges: removing “stop words” (very common, uninformative words) from your analysis to sharpen your results even more. When you’re ready, just say “next” to proceed!
