Add/remove custom stop words with spacy
PythonNlpStop WordsSpacyPython Problem Overview
What is the best way to add/remove stop words with spacy? I am using token.is_stop
function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!
Python Solutions
Solution 1 - Python
Using Spacy 2.0.11, you can update its stopwords set using one of the following:
To add a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")
To add several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
To remove a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")
To remove several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}
Note: To see the current set of stopwords, use:
print(nlp.Defaults.stop_words)
Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path")
and nlp.from_disk("/path")
(further described at https://spacy.io/usage/saving-loading).
Solution 2 - Python
You can edit them before processing your text like this (see this post):
>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True
Note: This seems to work <=v1.8. For newer versions, see other answers.
Solution 3 - Python
For version 2.0 I used this:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # <- set of Spacy's default stop words
STOP_WORDS.add("your_additional_stop_word_here")
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
lexeme.is_stop = True
This loads all stop words into a set.
You can amend your stop words to STOP_WORDS
or use your own list in the first place.
Solution 4 - Python
For 2.0 use the following:
for word in nlp.Defaults.stop_words:
lex = nlp.vocab[word]
lex.is_stop = True
Solution 5 - Python
This collects the stop words too :)
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
Solution 6 - Python
In latest version following would remove the word out of the list:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')
Solution 7 - Python
For version 2.3.0 If you want to replace the entire list instead of adding or removing a few stop words, you can do this:
custom_stop_words = set(['the','and','a'])
# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words
# Now load your model
nlp = spacy.load('en_core_web_md')
The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.