Normalize data before or after split of training and testing data?

Machine LearningData ScienceNormalizationTraining DataTrain Test-Split

Machine Learning Problem Overview

I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?

Machine Learning Solutions

Solution 1 - Machine Learning

You first need to split the data into training and test set (validation set could be useful too).

Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.

For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell

As an example, assuming we have the following data:

>>> import numpy as np
>>> X, y = np.arange(10).reshape((5, 2)), range(5)

where X represents our features:

>>> X
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]

and Y contains the corresponding label

>>> list(y)
>>> [0, 1, 2, 3, 4]

Step 1: Create training/testing sets

>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

>>> X_train
[[4 5]
 [0 1]
 [6 7]]
>>> X_test
[[2 3]
 [8 9]]
>>> y_train
[2, 0, 3]
>>> y_test
[1, 4]

Step 2: Normalise training data

>>> from sklearn import preprocessing
>>> normalizer = preprocessing.Normalizer()
>>> normalized_train_X = normalizer.fit_transform(X_train)
>>> normalized_train_X
array([[0.62469505, 0.78086881],
       [0.        , 1.        ],
       [0.65079137, 0.7592566 ]])

Step 3: Normalize testing data

>>> normalized_test_X = normalizer.transform(X_test)
>>> normalized_test_X
array([[0.5547002 , 0.83205029],
       [0.66436384, 0.74740932]])

Solution 2 - Machine Learning

you can use fit then transform learn

normalizer = preprocessing.Normalizer().fit(xtrain)


xtrainnorm = normalizer.transform(xtrain) 
xtestnorm = normalizer.transform(Xtest) 

Solution 3 - Machine Learning

In the specific setting of a train/test split, we need to distinguish between two transformations:

  1. transformations that change the value of an observation (row) according to information about a feature (column) and
  2. transformations that change the value of an observation according to information about that observation alone.

Two common examples of (1) are mean-centering (subtracting the mean of the feature) or scaling to unit variance (dividing by the standard deviation). Subtracting the mean and dividing by the standard deviation is a common transformation. In sklearn, it is implemented in sklearn.preprocessing.StandardScaler. Importantly, this is not the same as Normalizer. See below for exhaustive detail.

An example of (2) is transforming a feature by taking the logarithm, or raising each value to a power (e.g. squaring).

Transformations of the first type are best applied to the training data, with the centering and scaling values retained and applied to the test data afterwards. This is because using information about the test set to train the model may bias model comparison metrics to be overly optimistic. This can result in over-fitting & selection of a bogus model.

Transformations of the second type can be applied without regard to train/test splits, because the modified value of each observation depends only on the data about the observation itself, and not on any other data or observation(s).

This question has garnered some misleading answers. The rest of this answer is dedicated to showing how and why they are misleading.

The term "normalization" is ambiguous, and different authors and disciplines will use the term "normalization" in different ways. In the absence of a specific articulation of what "normalization" means, I think it's best to approach the question in the most general sense possible.

In this view, the question is not about sklearn.preprocessing.Normalizer specifically. Indeed, the Normalizer class is not mentioned in the question. For that matter, no software, programming language or library is mentioned, either. Moreover, even if the intent is to ask about Normalizer, the answers are still misleading because they mischaracterize what Normalizer does.

Even within the same library, the terminology can be inconsistent. For example, PyTorch implements normalize torchvision.transforms.Normalize and torch.nn.functional.normalize. One of these can be used to create output tensors with mean 0 and standard deviation 1, while the other creates outputs that have a norm of 1.

What the Normalizer Class Does

The Normalizer class is an example of (2) because it rescales each observation (row) individually so that the sum-of-squares is 1 for every row. (In the corner-case that a row has sum-of-squares equal to 0, no rescaling is done.) The first sentence of the documentation for the Normalizer says

> Normalize samples individually to unit norm.

This simple test code validates this understanding:

X = np.arange(10).reshape((5, 2))
normalizer = preprocessing.Normalizer()
normalized_all_X = normalizer.transform(X)
sum_of_squares = np.square(normalized_all_X).sum(1)

This prints True because the result is an array of 1s, as described in the documentation.

The normalizer implements fit, transform and fit_transform methods even though some of these are just "pass-through" methods. This is so that there is a consistent interface across preprocessing methods, not because the methods' behaviors needs to distinguish between different data partitions.

Misleading Presentation 1

The Normalizer class does not subtract the column means

Another answer writes:

> Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance.

Ok, so let's try this out. Using the code snippet from the answer, we have

X = np.arange(10).reshape((5, 2))

X_train = X[:3]
X_test = X[3:]

normalizer = preprocessing.Normalizer()
normalized_train_X = normalizer.fit_transform(X_train)
column_means_train_X = normalized_train_X.mean(0)

This is the value of column_means_train_X. It is not zero!

[0.42516214 0.84670847]

If the column means had been subtracted from the columns, then the centered column means would be 0.0. (This is simple to prove. The sum of n numbers x=[x1,x2,x3,...,xn] is S. The mean of those numbers is S / n. Then we have sum(x - S/n) = S - n * (S / n) = 0.)

We can write similar code to show that the columns have not been divided by the variance. (Neither have the columns been divided by the standard deviation, which would be the more usual choice).

Misleading Presentation 2

Applying the Normalizer class to the whole data set does not change the result.

> If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

This claim is true as far as it goes, but it has absolutely no bearing on the Normalizer class. Indeed, Giorgos Myrianthous's chosen example is actually immune to the effect that they are describing.

If the Normalizer class did involve the means of the features, then we would expect that the normalize results will change depending on which of our data are included in the training set.

For example, the sample mean is a weighted sum of every observation in the sample. If we were computing column means and subtracting them, the results of applying this to all of the data would differ from applying it to only the training data subset. But we've already demonstrated that Normalizer doesn't subtract column means.

Furthermore, these tests show that applying Normalizer to all of the data or just some of the data makes no difference for the results.

If we apply this method separately, we have

[[0.         1.        ]
 [0.5547002  0.83205029]
 [0.62469505 0.78086881]]

[[0.65079137 0.7592566 ]
 [0.66436384 0.74740932]]

And if we apply it together, we have

[[0.         1.        ]
 [0.5547002  0.83205029]
 [0.62469505 0.78086881]
 [0.65079137 0.7592566 ]
 [0.66436384 0.74740932]]

where the only difference is that we have 2 arrays in the first case, due to partitioning. Let's just double-check that the combined arrays are the same:

normalized_train_X = normalizer.fit_transform(X_train)
normalized_test_X = normalizer.transform(X_test)
normalized_all_X = normalizer.transform(X)
assert np.allclose(np.vstack((normalized_train_X, normalized_test_X)),normalized_all_X )

No exception is raised; they're numerically identical.

But sklearn's transformers are sometimes stateful, so let's make a new object just to make sure this isn't some state-related behavior.

new_normalizer = preprocessing.Normalizer()
new_normalized_all_X = new_normalizer.fit_transform(X)
assert np.allclose(np.vstack((normalized_train_X, normalized_test_X)),new_normalized_all_X )

In the second case, we still have no exception raised.

We can conclude that for the Normalizer class, it makes no difference if the data are partitioned or not.

Solution 4 - Machine Learning

Ask yourself if your data will look different depending on whether you transform before or after your split. If you're doing a log2 transformation, the order doesn't matter because each value is transformed independently of the others. If you're scaling and centering your data, the order does matter because an outlier can drastically change the final distribution. You're allowing the test set to "spill over" and affect your training set, potentially causing overly optimistic performance measures.

For R uses, the caret package is good at handling test/train splits. You can add the argument preProcess = c("scale", "center") to the train function and it will automatically apply any transformation from the training data onto the test data.

Tl;dr - if the data is different depending on whether your normalize before or after your split, do it before


All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionhemantView Question on Stackoverflow
Solution 1 - Machine LearningGiorgos MyrianthousView Answer on Stackoverflow
Solution 2 - Machine Learninguser3452134View Answer on Stackoverflow
Solution 3 - Machine LearningSycoraxView Answer on Stackoverflow
Solution 4 - Machine LearningJeff BezosView Answer on Stackoverflow