Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.
In this article you will learn how to remove stop words with the nltk module.
- Easy Natural Language Processing (NLP) in Python
- Natural Language Processing with Deep Learning in Python
Natural Language Processing: remove stop words
We start with the code from the previous tutorial, which tokenized words.
We modify it to:
A module has been imported:
We get a set of English stop words using the line:
The returned list stopWords contains 153 stop words on my computer.
You can view the length or contents of this array with the lines:
We create a new list called wordsFiltered which contains all words which are not stop words.
To create it we iterate over the list of words and only add it if its not in the stopWords list.