NLTK stop words


NLTK
Natural Language Processing with Python
Natural language processing (nlp) is a research field that presents many challenges such as natural language understanding.

Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

In this article you will learn how to remove stop words with the nltk module.

Natural Language Processing: remove stop words

We start with the code from the previous tutorial, which tokenized words.

from nltk.tokenize import sent_tokenize, word_tokenize
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
words = word_tokenize(data)
print(words)

We modify it to:

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
 
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
 
for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)
 
print(wordsFiltered)

A module has been imported:

from nltk.corpus import stopwords

We get a set of English stop words using the line:

stopWords = set(stopwords.words('english'))

The returned list stopWords contains 153 stop words on my computer.
You can view the length or contents of this array with the lines:

print(len(stopWords))
print(stopWords)

We create a new list called wordsFiltered which contains all words which are not stop words.
To create it we iterate over the list of words and only add it if its not in the stopWords list.

for w in words:
    if w not in stopWords:
        wordsFiltered.append(w)