NLTK stop words

Python hosting: Host, run, and code Python in the cloud!

Natural Language Processing (NLP) is an intricate field focused on the challenge of understanding human language. One of its core aspects is handling ‘stop words’ – words which, due to their high frequency in text, often don’t offer significant insights on their own.

Stop words like ‘the’, ‘and’, and ‘I’, although common, don’t usually provide meaningful information about a document’s specific topic. By eliminating these words from a corpus, we can more easily identify unique and relevant terms.

It’s important to note that there isn’t a universally accepted list of stop words in NLP. However, the Natural Language Toolkit, or NLTK, does offer a list for researchers and practitioners to utilize.

Throughout this guide, you’ll discover how to efficiently remove stop words using the nltk module, streamlining your text data for better analysis.

Natural Language Processing with Python

Removing Stop Words in NLP

We’ll be building upon code from a prior tutorial that dealt with tokenizing words.

Despite being crucial for sentence structure, most stop words don’t enhance our understanding of sentence semantics. Below is a small sample of frequently used English words:

N = ['stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I', 'that', 'had', 'on', 'for', 'were', 'was']

Thankfully, with NLTK, you don’t have to manually define every stop word. The library already includes a predefined list of common words that typically don’t carry much semantic weight. NLTK’s default list contains 40 such words, for example: “a”, “an”, “the”, and “of”.

How to Access NLTK’s Stopword List

Stop words are those which, due to their ubiquity, aren’t typically used to describe a document’s main topic. If you’re setting up nltk for the first time and encounter an error stating “NLTK stop words not found”, make sure to download the necessary resources as shown below:

>>> import nltk
>>> nltk.download('stopwords')

To get a glimpse of the stop words that NLTK offers for English (or other languages), you can use the following snippet:

import nltk
from nltk.corpus import stopwords

stops = set(stopwords.words('english'))
print(stops)

For those working with languages other than English, NLTK provides stop word lists for several other languages, such as German, Indonesian, Portuguese, and Spanish:

stops = set(stopwords.words('german'))
stops = set(stopwords.words('indonesia'))
stops = set(stopwords.words('portuguese'))
stops = set(stopwords.words('spanish'))

Implementing Stopword Filtering with NLTK

For the purpose of this demonstration, we’ll use a predefined string. However, this method can easily be applied to a text file:

text = open("shakespeare.txt").read().lower()

Now, let’s observe how to filter out the stop words:

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = [w for w in words if w not in stopWords]

print(wordsFiltered)

In the provided code, we first imported the necessary nltk modules, retrieved the set of English stop words, tokenized our text, and then created a list, wordsFiltered, which only contains words not present in the stop word list.

This approach streamlines the data and focuses on terms that are more likely to offer unique insights about the text’s topic.

← Previous Tutorial Next Tutorial →

Posted in nltk

2021-07-22

Leave a Reply:

Haris saeed • 2022-09-03T11:51:43.891Z

i want roman urdu stop words how can i do that?

Frank • 2022-09-03T11:51:44.891Z

You can create your own list of stop words

newWords = ['word1','word2']
stopwords.extend(newWords)