Keras Text Preprocessing - Saving Tokenizer object to file for scoring

Machine LearningNeural NetworkNlpDeep LearningKeras

Machine Learning Problem Overview


I've trained a sentiment classifier model using Keras library by following the below steps(broadly).

  1. Convert Text corpus into sequences using Tokenizer object/class
  2. Build a model using the model.fit() method
  3. Evaluate this model

Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?

Machine Learning Solutions


Solution 1 - Machine Learning

The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:

import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

Solution 2 - Machine Learning

Tokenizer class has a function to save date into JSON format:

tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))

The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:

with open('tokenizer.json') as f:
    data = json.load(f)
    tokenizer = tokenizer_from_json(data)

Solution 3 - Machine Learning

The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).

Concrete Example:

from keras.preprocessing.text import Tokenizer

docs = ["A heart that",
		 "full up like",
		 "a landfill",
        "no surprises",
        "and no alarms"
		 "a job that slowly"
		 "Bruises that",
		 "You look so",
		 "tired happy",
		 "no alarms",
        "and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT  TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train)  # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))

# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs)  # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))

Results:

result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]

Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].

Solution 4 - Machine Learning

I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True

Solution 5 - Machine Learning

I found the following snippet provided at following link by @thusv89.

Save objects:

import pickle

with open('data_objects.pickle', 'wb') as handle:
    pickle.dump(
        {'input_tensor': input_tensor, 
         'target_tensor': target_tensor, 
         'inp_lang': inp_lang,
         'targ_lang': targ_lang,
        }, handle, protocol=pickle.HIGHEST_PROTOCOL)

Load objects:

with open("dataset_fr_en.pickle", 'rb') as f:
    data = pickle.load(f)
    input_tensor = data['input_tensor']
    target_tensor = data['target_tensor']
    inp_lang = data['inp_lang']
    targ_lang = data['targ_lang']

Solution 6 - Machine Learning

Quite easy, because Tokenizer class has provided two funtions for save and load:

save —— Tokenizer.to_json()

load —— keras.preprocessing.text.tokenizer_from_json

In to_json() method,it call "get_config" method which handle this:

    json_word_counts = json.dumps(self.word_counts)
    json_word_docs = json.dumps(self.word_docs)
    json_index_docs = json.dumps(self.index_docs)
    json_word_index = json.dumps(self.word_index)
    json_index_word = json.dumps(self.index_word)

    return {
        'num_words': self.num_words,
        'filters': self.filters,
        'lower': self.lower,
        'split': self.split,
        'char_level': self.char_level,
        'oov_token': self.oov_token,
        'document_count': self.document_count,
        'word_counts': json_word_counts,
        'word_docs': json_word_docs,
        'index_docs': json_index_docs,
        'index_word': json_index_word,
        'word_index': json_word_index
    }

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRajkumar KaliyaperumalView Question on Stackoverflow
Solution 1 - Machine LearningMarcin MożejkoView Answer on Stackoverflow
Solution 2 - Machine LearningMaxView Answer on Stackoverflow
Solution 3 - Machine LearningQuetzalcoatlView Answer on Stackoverflow
Solution 4 - Machine Learninguser9170View Answer on Stackoverflow
Solution 5 - Machine LearningArunView Answer on Stackoverflow
Solution 6 - Machine Learningchales sandyView Answer on Stackoverflow