Using explicit (predefined) validation set for grid search with sklearn

PythonValidationScikit LearnCross Validation

Python Problem Overview


I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.

I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?

from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV

# (some code left out to simplify things)

skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
                             class_weight=penalty_weights),
                     param_grid=tuned_parameters,
                     n_jobs=2,
                     pre_dispatch="n_jobs",
                     cv=skf,
                     scoring=scorer)
clf.fit(X_train, y_train)

Python Solutions


Solution 1 - Python

Use PredefinedSplit

ps = PredefinedSplit(test_fold=your_test_fold)

then set cv=ps in GridSearchCV

> test_fold : “array-like, shape (n_samples,) > > test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.

Also see here > when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.

Solution 2 - Python

Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.

# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
  {'C': [1, 10, 100], 'kernel': ['linear']},
  {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.

Solution 3 - Python

# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit

# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)

# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]

# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)

# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
                   cv=pds,
                   param_grid=param_grid)

# Fit with all data
clf.fit(X, y)

Solution 4 - Python

To add to the original answer, when the train-valid-test split is not done with Scikit-learn's train_test_split() function, i.e., the dataframes are already split manually beforehand and scaled/normalized so as to prevent leakage from training data, the numpy arrays can be concatenated.

import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)

clf = GridSearchCV(estimator = estimator,
                   cv=pds,
                   param_grid=param_grid)

# Fit with all data
clf.fit(X, y)

Solution 5 - Python

I wanted to provide some reproducible code that creates a validation split using the last 20% of observations.


from sklearn import datasets
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

# load data
df_train = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing().target

param_grid = {"max_depth": [5, 6],
              'learning_rate': [0.03, 0.06],
              'subsample': [.5, .75]
             }

model = GradientBoostingRegressor()

# Create a single validation split
val_prop = .2
n_val_rows = round(len(df_train) * val_prop)
val_starting_index = len(df_train) - n_val_rows
cv = PredefinedSplit([-1 if i < val_starting_index else 0 for i in df_train.index])


# Use PredefinedSplit in GridSearchCV
results = GridSearchCV(estimator = model,
                       cv=cv,
                       param_grid=param_grid,
                       verbose=True,
                       n_jobs=-1)

# Fit with all data
results.fit(df_train, y)

results.best_params_

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionpirView Question on Stackoverflow
Solution 1 - PythonyangjieView Answer on Stackoverflow
Solution 2 - PythoncgnorthcuttView Answer on Stackoverflow
Solution 3 - PythonVinubalan KrishnasamyView Answer on Stackoverflow
Solution 4 - PythonarjunsivaView Answer on Stackoverflow
Solution 5 - PythonKG in ChicagoView Answer on Stackoverflow