Natural language processing (nlp) is a research field that presents many challenges such as natural language understanding.
Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.
In this article you will learn how to remove stop words with the nltk module.
- Easy Natural Language Processing (NLP) in Python
- Natural Language Processing with Deep Learning in Python
Natural Language Processing: remove stop words
We start with the code from the previous tutorial, which tokenized words.
from nltk.tokenize import sent_tokenize, word_tokenize data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy." words = word_tokenize(data) print(words)
We modify it to:
from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import stopwords data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy." stopWords = set(stopwords.words('english')) words = word_tokenize(data) wordsFiltered =  for w in words: if w not in stopWords: wordsFiltered.append(w) print(wordsFiltered)
A module has been imported:
from nltk.corpus import stopwords
We get a set of English stop words using the line:
stopWords = set(stopwords.words('english'))
The returned list stopWords contains 153 stop words on my computer.
You can view the length or contents of this array with the lines:
We create a new list called wordsFiltered which contains all words which are not stop words.
To create it we iterate over the list of words and only add it if its not in the stopWords list.
for w in words: if w not in stopWords: wordsFiltered.append(w)