python logo

NLTK stop words

Python hosting: Host, run, and code Python in the cloud!

NLTK Natural Language Processing with PythonNatural language processing (nlp) is a research field that presents many challenges such as natural language understanding.

Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

In this article you will learn how to remove stop words with the nltk module.

Related course

Natural Language Processing: remove stop words


We start with the code from the previous tutorial, which tokenized words.

The stopwords are a list of words that are very very common but don’t provide useful information for most text analysis procedures.

While it is helpful for understand the structure of sentences, it does not help you understand the semantics of the sentences themselves. Here’s a list of most commonly used words in English:

N = [ 'stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I', 'that', 'had', 'on', 'for', 'were', 'was']

With nltk you don’t have to define every stop word manually. Stop words are frequently used words that carry very little meaning. Stop words are words that are so common they are basically ignored by typical tokenizers.

By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc.

The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content. They are pre-defined and cannot be removed.


from nltk.tokenize import sent_tokenize, word_tokenize

data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
words = word_tokenize(data)
print(words)

Getting rid of stop words makes a lot of sense for any Natural Language Processing task. In this code you will see how you can get rid of these ugly stop words from your texts.

First let’s import a few packages that we will need:

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

The last one is key here, it contains all the stop words.

from nltk.corpus import stopwords

This is a list of lexical stop words in English. That is, these words are ignored during most natural language processing tasks, such as part-of-speech tagging, tokenization and parsing.

NLTK Stopword List

So stopwords are words that are very common in human language but are generally not useful because they represent particularly common words such as “the”, “of”, and “to”.

If you get the error NLTK stop words not found, make sure to download the stop words after installing nltk.

>>> import nltk
>>> nltk.download('stopwords')

You can view the list of included stop words in NLTK using the code below:

import nltk
from nltk.corpus import stopwords

stops = set(stopwords.words('english'))
print(stops)

You can do that for different languages, so you can configure for the language you need.

stops = set(stopwords.words('german'))
stops = set(stopwords.words('indonesia'))
stops = set(stopwords.words('portuguese'))
stops = set(stopwords.words('spanish'))

Filter stop words nltk

We will use a string (data) as text. Of course you can also do this with a text file as input. If you want to use a text file instead, you can do this:

text = open("shakespeare.txt").read().lower()

The program below filters stop words from the data.


from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []

for w in words:
if w not in stopWords:
wordsFiltered.append(w)

print(wordsFiltered)

A module has been imported:


from nltk.corpus import stopwords

We get a set of English stop words using the line:


stopWords = set(stopwords.words('english'))

The returned list stopWords contains 153 stop words on my computer.
You can view the length or contents of this array with the lines:


print(len(stopWords))
print(stopWords)

We create a new list called wordsFiltered which contains all words which are not stop words.
To create it we iterate over the list of words and only add it if its not in the stopWords list.


for w in words:
if w not in stopWords:
wordsFiltered.append(w)

BackNext





Leave a Reply: