Category: nltk
NLTK stop words

Stop words are common words like ‘the’, ‘and’, ‘I’, etc. that are very frequent in text, and so don’t convey insights into the specific topic of a document. We can remove these stop words from the text in a given corpus to clean up the data, and identify words that are more rare and potentially more relevant to what we’re interested in.
Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.
In this article you will learn how to remove stop words with the nltk module.
Related course
Natural Language Processing: remove stop words
We start with the code from the previous tutorial, which tokenized words.
The stopwords are a list of words that are very very common but don’t provide useful information for most text analysis procedures.
While it is helpful for understand the structure of sentences, it does not help you understand the semantics of the sentences themselves. Here’s a list of most commonly used words in English:
N = [ 'stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I', 'that', 'had', 'on', 'for', 'were', 'was'] |
With nltk you don’t have to define every stop word manually. Stop words are frequently used words that carry very little meaning. Stop words are words that are so common they are basically ignored by typical tokenizers.
By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc.
The stopwords in nltk are the most common words in data. They are words that you do not want to use to describe the topic of your content. They are pre-defined and cannot be removed.
|
Getting rid of stop words makes a lot of sense for any Natural Language Processing task. In this code you will see how you can get rid of these ugly stop words from your texts.
First let’s import a few packages that we will need:
from nltk.tokenize import sent_tokenize, word_tokenize |
The last one is key here, it contains all the stop words.
from nltk.corpus import stopwords |
This is a list of lexical stop words in English. That is, these words are ignored during most natural language processing tasks, such as part-of-speech tagging, tokenization and parsing.
NLTK Stopword List
So stopwords are words that are very common in human language but are generally not useful because they represent particularly common words such as “the”, “of”, and “to”.
If you get the error NLTK stop words not found, make sure to download the stop words after installing nltk.
import nltk |
You can view the list of included stop words in NLTK using the code below:
import nltk |
You can do that for different languages, so you can configure for the language you need.
stops = set(stopwords.words('german')) |
Filter stop words nltk
We will use a string (data) as text. Of course you can also do this with a text file as input. If you want to use a text file instead, you can do this:
text = open("shakespeare.txt").read().lower() |
The program below filters stop words from the data.
|
A module has been imported:
|
We get a set of English stop words using the line:
|
The returned list stopWords contains 153 stop words on my computer.
You can view the length or contents of this array with the lines:
|
We create a new list called wordsFiltered which contains all words which are not stop words.
To create it we iterate over the list of words and only add it if its not in the stopWords list.
|
sent_tokenize

In this article you will learn how to tokenize data (by words and sentences).
Related course:
Easy Natural Language Processing (NLP) in Python
Install NLTK
Install NLTK with Python 2.x using:
|
Install NLTK with Python 3.x using:
|
Installation is not complete after these commands. Open python and type:
|
A graphical interface will be presented:
Click all and then click download. It will download all the required packages which may take a while, the bar on the bottom shows the progress.
Tokenize words
A sentence or data can be split into words using the method word_tokenize():
|
This will output:
|
All of them are words except the comma. Special characters are treated as separate tokens.
Tokenizing sentences
The same principle can be applied to sentences. Simply change the to sent_tokenize()
We have added two sentences to the variable data:
|
Outputs:
|
NLTK and arrays
If you wish to you can store the words and sentences in arrays:
|
nltk stemming
A word stem is part of a word. It is sort of a normalization idea, but linguistic.
For example, the stem of the word waiting is wait.

Given words, NLTK can find the stems.
Related course
Easy Natural Language Processing (NLP) in Python
NLTK - stemming
Start by defining some words:
|
We import the module:
|
And stem the words in the list using:
|

You can do word stemming for sentences too:
|

There are more stemming algorithms, but Porter (PorterStemer) is the most popular.
nltk tags
The module NLTK can automatically tag speech.
Given a sentence or paragraph, it can label words such as verbs, nouns and so on.
NLTK - speech tagging example
The example below automatically tags words with a corresponding class.
|
Related course
Easy Natural Language Processing (NLP) in Python
This will output a tuple for each word:
where the second element of the tuple is the class.
The meanings of these speech codes are shown in the table below:
We can filter this data based on the type of word:
|
which outputs:

The classes include past tense, present. Using this technique we can quickly derive meaning from a text.
python prediction
Natural Language Processing with PythonWe can use natural language processing to make predictions.
Example: Given a product review, a computer can predict if its positive or negative based on the text.
In this article you will learn how to make a prediction program based on natural language processing.
Related course: Natural Language Processing with Python
nlp prediction example
Given a name, the classifier will predict if it’s a male or female.
To create our analysis program, we have several steps:
- Data preparation
- Feature extraction
- Training
- Prediction
Data preparation
The first step is to prepare data.
We use the names set included with nltk.
|
This dataset is simply a collection of tuples. To give you an idea of what the dataset looks like:
|
You can define your own set of tuples if you wish, its simply a list containing many tuples.
Feature extraction
Based on the dataset, we prepare our feature. The feature we will use is the last letter of a name:
We define a featureset using:
|
and the features (last letters) are extracted using:
|
Training and prediction
We train and predict using:
|
Example
A classifier has a training and a test phrase.
|
If you want to give the name during runtime, change the last line to:
|
For Python 2, use raw_input.
sentiment analysis python
Sentiment Analysis
In Natural Language Processing there is a concept known as Sentiment Analysis.
Given a movie review or a tweet, it can be automatically classified in categories.
These categories can be user defined (positive, negative) or whichever classes you want. Sentiment Analysis, example flow
Related courses
Sentiment Analysis Example
Classification is done using several steps: training and prediction.
The training phase needs to have training data, this is example data in which we define examples. The classifier will use the training data to make predictions.

We start by defining 3 classes: positive, negative and neutral.
Each of these is defined by a vocabulary:
|
Every word is converted into a feature using a simplified bag of words model:
|
Our training set is then the sum of these three feature sets:
|
We train the classifier:
|
And make predictions.
Code example
This example classifies sentences according to the training set.
|
To enter the input sentence manually, use the input or raw_input functions.
The better your training data is, the more accurate your predictions. In this example our training data is very small.
Training sets
There are many training sets available:
A good dataset will increase the accuracy of your classifier.