Python hosting: Host, run, and code Python in the cloud!
Natural Language Processing (NLP) is an intricate field focused on the challenge of understanding human language. One of its core aspects is handling ‘stop words’ – words which, due to their high frequency in text, often don’t offer significant insights on their own.
Stop words like ‘the’, ‘and’, and ‘I’, although common, don’t usually provide meaningful information about a document’s specific topic. By eliminating these words from a corpus, we can more easily identify unique and relevant terms.
It’s important to note that there isn’t a universally accepted list of stop words in NLP. However, the Natural Language Toolkit, or NLTK, does offer a list for researchers and practitioners to utilize.
Throughout this guide, you’ll discover how to efficiently remove stop words using the nltk module, streamlining your text data for better analysis.
We’ll be building upon code from a prior tutorial that dealt with tokenizing words.
Despite being crucial for sentence structure, most stop words don’t enhance our understanding of sentence semantics. Below is a small sample of frequently used English words:
N = ['stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I', 'that', 'had', 'on', 'for', 'were', 'was']
Thankfully, with NLTK, you don’t have to manually define every stop word. The library already includes a predefined list of common words that typically don’t carry much semantic weight. NLTK’s default list contains 40 such words, for example: “a”, “an”, “the”, and “of”.
Stop words are those which, due to their ubiquity, aren’t typically used to describe a document’s main topic. If you’re setting up nltk for the first time and encounter an error stating “NLTK stop words not found”, make sure to download the necessary resources as shown below:
To get a glimpse of the stop words that NLTK offers for English (or other languages), you can use the following snippet:
For those working with languages other than English, NLTK provides stop word lists for several other languages, such as German, Indonesian, Portuguese, and Spanish:
stops = set(stopwords.words('german'))
For the purpose of this demonstration, we’ll use a predefined string. However, this method can easily be applied to a text file:
text = open("shakespeare.txt").read().lower()
Now, let’s observe how to filter out the stop words:
from nltk.tokenize import sent_tokenize, word_tokenize
In the provided code, we first imported the necessary nltk modules, retrieved the set of English stop words, tokenized our text, and then created a list,
wordsFiltered, which only contains words not present in the stop word list.
This approach streamlines the data and focuses on terms that are more likely to offer unique insights about the text’s topic.