Faster way to remove stop words in Python
PythonRegexStop WordsPython Problem Overview
I am trying to remove stopwords from a string of text:
from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])
I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub
but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.
Note: I tried someone's suggest of wrapping stopwords.words('english')
with set()
but that made no difference.
Thank you.
Python Solutions
Solution 1 - Python
Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.
from nltk.corpus import stopwords
cachedStopWords = stopwords.words("english")
def testFuncOld():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])
def testFuncNew():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in cachedStopWords])
if __name__ == "__main__":
for i in xrange(10000):
testFuncOld()
testFuncNew()
I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.
nCalls Cumulative Time
10000 7.723 words.py:7(testFuncOld)
10000 0.140 words.py:11(testFuncNew)
So, caching the stopwords instance gives a ~70x speedup.
Solution 2 - Python
Use a regexp to remove all words which do not match:
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)
This will probably be way faster than looping yourself, especially for large input strings.
If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.
Solution 3 - Python
Sorry for late reply. Would prove useful for new users.
-
Create a dictionary of stopwords using collections library
-
Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))
from collections import Counter stop_words = stopwords.words('english') stopwords_dict = Counter(stop_words) text = ' '.join([word for word in text.split() if word not in stopwords_dict])
Solution 4 - Python
First, you're creating stop words for each string. Create it once. Set would be great here indeed.
forbidden_words = set(stopwords.words('english'))
Later, get rid of []
inside join
. Use generator instead.
Replace
' '.join([x for x in ['a', 'b', 'c']])
with
' '.join(x for x in ['a', 'b', 'c'])
Next thing to deal with would be to make .split()
yield values instead of returning an array. I believe See thist hread for why regex
would be good replacement here.s.split()
is actually fast.
Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.
Solution 5 - Python
Try using this by avoid looping and instead using regex to remove stopwords:
import re
from nltk.corpus import stopwords
cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)
Solution 6 - Python
Using just a regular dict seems to be the fastest solution by far.
Surpassing even the Counter solution by about 10%
from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'
text = " ".join([word for word in text.split() if word not in stopwords_dict])
Tested using the cProfile profiler
You can find the test code used here: https://gist.github.com/maxandron/3c276924242e7d29d9cf980da0a8a682
EDIT:
On top of that if we replace the list comprehension with a loop we get another 20% increase in performance
from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'
new = ""
for word in text.split():
if word not in stopwords_dict:
new += word
text = new