Save MinMaxScaler model in sklearn

PythonMachine LearningScikit LearnNormalization

Python Problem Overview


I'm using the MinMaxScaler model in sklearn to normalize the features of a model.

training_set = np.random.rand(4,4)*10
training_set

       [[ 6.01144787,  0.59753007,  2.0014852 ,  3.45433657],
       [ 6.03041646,  5.15589559,  6.64992437,  2.63440202],
       [ 2.27733136,  9.29927394,  0.03718093,  7.7679183 ],
       [ 9.86934288,  7.59003904,  6.02363739,  2.78294206]]


scaler = MinMaxScaler()
scaler.fit(training_set)    
scaler.transform(training_set)


   [[ 0.49184811,  0.        ,  0.29704831,  0.15972182],
   [ 0.4943466 ,  0.52384506,  1.        ,  0.        ],
   [ 0.        ,  1.        ,  0.        ,  1.        ],
   [ 1.        ,  0.80357559,  0.9052909 ,  0.02893534]]

Now I want to use the same scaler to normalize the test set:

   [[ 8.31263467,  7.99782295,  0.02031658,  9.43249727],
   [ 1.03761228,  9.53173021,  5.99539478,  4.81456067],
   [ 0.19715961,  5.97702519,  0.53347403,  5.58747666],
   [ 9.67505429,  2.76225253,  7.39944931,  8.46746594]]

But I don't want so use the scaler.fit() with the training data all the time. Is there a way to save the scaler and load it later from a different file?

Python Solutions


Solution 1 - Python

Update: sklearn.externals.joblib is deprecated. Install and use the pure joblib instead. Please see Engineero's answer below, which is otherwise identical to mine.

Original answer

Even better than pickle (which creates much larger files than this method), you can use sklearn's built-in tool:

from sklearn.externals import joblib
scaler_filename = "scaler.save"
joblib.dump(scaler, scaler_filename) 

# And now to load...

scaler = joblib.load(scaler_filename) 

Solution 2 - Python

So I'm actually not an expert with this but from a bit of research and a few helpful links, I think pickle and sklearn.externals.joblib are going to be your friends here.

The package pickle lets you save models or "dump" models to a file.

I think this link is also helpful. It talks about creating a persistence model. Something that you're going to want to try is:

# could use: import pickle... however let's do something else
from sklearn.externals import joblib 

# this is more efficient than pickle for things like large numpy arrays
# ... which sklearn models often have.   

# then just 'dump' your file
joblib.dump(clf, 'my_dope_model.pkl') 

Here is where you can learn more about the sklearn externals.

Let me know if that doesn't help or I'm not understanding something about your model.

Note: sklearn.externals.joblib is deprecated. Install and use the pure joblib instead

Solution 3 - Python

Just a note that sklearn.externals.joblib has been deprecated and is superseded by plain old joblib, which can be installed with pip install joblib:

import joblib
joblib.dump(my_scaler, 'scaler.gz')
my_scaler = joblib.load('scaler.gz')

Note that file extensions can be anything, but if it is one of ['.z', '.gz', '.bz2', '.xz', '.lzma'] then the corresponding compression protocol will be used. Docs for joblib.dump() and joblib.load() methods.

Solution 4 - Python

You can use pickle, to save the scaler:

import pickle
scalerfile = 'scaler.sav'
pickle.dump(scaler, open(scalerfile, 'wb'))

Load it back:

import pickle
scalerfile = 'scaler.sav'
scaler = pickle.load(open(scalerfile, 'rb'))
test_scaled_set = scaler.transform(test_set)

Solution 5 - Python

The best way to do this is to create an ML pipeline like the following:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.externals import joblib


pipeline = make_pipeline(MinMaxScaler(),YOUR_ML_MODEL() )

model = pipeline.fit(X_train, y_train)
Now you can save it to a file:
joblib.dump(model, 'filename.mod') 
Later you can load it like this:
model = joblib.load('filename.mod')

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionLuis Ramon Ramirez RodriguezView Question on Stackoverflow
Solution 1 - PythonIvan VegnerView Answer on Stackoverflow
Solution 2 - Pythonjlarks32View Answer on Stackoverflow
Solution 3 - PythonEngineeroView Answer on Stackoverflow
Solution 4 - PythonPsidomView Answer on Stackoverflow
Solution 5 - PythonPSNView Answer on Stackoverflow