How to remove stop words using nltk or python

Python Problem Overview

So I have a dataset that I would like to remove stop words from using

stopwords.words('english')

I'm struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i'm struggling with is comparing to this list and removing the stop words. Any help is appreciated.

Python Solutions

Solution 1 - Python

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

Solution 2 - Python

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

Solution 3 - Python

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

Solution 4 - Python

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

Solution 5 - Python

There's a very simple light-weight python package stop-words just for this sake.

Fist install the package using: pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

Solution 6 - Python

Use textcleaner library to remove stopwords from your data.

Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

Follow these steps to do so with this library.

pip install textcleaner

After installing:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Use above code to remove the stop-words.

Solution 7 - Python

Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

Solution 8 - Python

you can use this function, you should notice that you need to lower all the words

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

Solution 9 - Python

using filter:

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

Solution 10 - Python

Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.

In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.

The library is called 'textfeatures'. You can use it as follows:

! pip install textfeatures
import textfeatures as tf
import pandas as pd

For example, suppose you have the following set of strings:

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

Now, call the stopwords() function and pass the parameters you want:

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

The result is going to be:

 	text 	                             stopwords
0 	blue car and blue window 	         [and]
1 	black crow in the window 	         [in, the]
2 	i see my reflection in the window 	 [i, my, in, the]

As you can see, the last column has the stop words included in that docoument (record).

Solution 11 - Python

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence)

Solution 12 - Python

I will show you some example First I extract the text data from the data frame (twitter_df) to process further as following

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

Then to tokenize I use the following method

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

Then, to remove stop words,

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

I Think this will help you

Solution 13 - Python

In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

Solution 14 - Python

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

Content Type	Original Author	Original Content on Stackoverflow
Question	Alex	View Question on Stackoverflow
Solution 1 - Python	Daren Thomas	View Answer on Stackoverflow
Solution 2 - Python	David Lemphers	View Answer on Stackoverflow
Solution 3 - Python	sumitjainjr	View Answer on Stackoverflow
Solution 4 - Python	das_weezul	View Answer on Stackoverflow
Solution 5 - Python	user_3pij	View Answer on Stackoverflow
Solution 6 - Python	Yugant Hadiyal	View Answer on Stackoverflow
Solution 7 - Python	justadev	View Answer on Stackoverflow
Solution 8 - Python	Mohammed_Ashour	View Answer on Stackoverflow
Solution 9 - Python	Saeid BK	View Answer on Stackoverflow
Solution 10 - Python	Taie	View Answer on Stackoverflow
Solution 11 - Python	H M	View Answer on Stackoverflow
Solution 12 - Python	Yasuni Chamodya	View Answer on Stackoverflow
Solution 13 - Python	Jonathan Besomi	View Answer on Stackoverflow
Solution 14 - Python	Muhammad Yusuf	View Answer on Stackoverflow