# Normalize data before or after split of training and testing data?

Machine LearningData ScienceNormalizationTraining DataTrain Test-Split## Machine Learning Problem Overview

I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?

## Machine Learning Solutions

## Solution 1 - Machine Learning

You first need to split the data into training and test set (validation set could be useful too).

Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.

**For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell**

As an example, assuming we have the following data:

```
>>> import numpy as np
>>>
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
```

where `X`

represents our features:

```
>>> X
[[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]
```

and `Y`

contains the corresponding label

```
>>> list(y)
>>> [0, 1, 2, 3, 4]
```

**Step 1: Create training/testing sets**

```
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
>>> X_train
[[4 5]
[0 1]
[6 7]]
>>>
>>> X_test
[[2 3]
[8 9]]
>>>
>>> y_train
[2, 0, 3]
>>>
>>> y_test
[1, 4]
```

**Step 2: Normalise training data**

```
>>> from sklearn import preprocessing
>>>
>>> normalizer = preprocessing.Normalizer()
>>> normalized_train_X = normalizer.fit_transform(X_train)
>>> normalized_train_X
array([[0.62469505, 0.78086881],
[0. , 1. ],
[0.65079137, 0.7592566 ]])
```

**Step 3: Normalize testing data**

```
>>> normalized_test_X = normalizer.transform(X_test)
>>> normalized_test_X
array([[0.5547002 , 0.83205029],
[0.66436384, 0.74740932]])
```

## Solution 2 - Machine Learning

you can use fit then transform learn

```
normalizer = preprocessing.Normalizer().fit(xtrain)
```

transform

```
xtrainnorm = normalizer.transform(xtrain)
xtestnorm = normalizer.transform(Xtest)
```

## Solution 3 - Machine Learning

In the specific setting of a train/test split, we need to distinguish between two transformations:

- transformations that change the value of an observation (row) according to information about a feature (column) and
- transformations that change the value of an observation according to information
*about that observation*.**alone**

Two common examples of (1) are mean-centering (subtracting the mean of the feature) or scaling to unit variance (dividing by the *standard deviation*). Subtracting the mean and dividing by the *standard deviation* is a common transformation. In `sklearn`

, it is implemented in `sklearn.preprocessing.StandardScaler`

. **Importantly, this is not the same as Normalizer.** See below for exhaustive detail.

An example of (2) is transforming a feature by taking the logarithm, or raising each value to a power (e.g. squaring).

Transformations of the first type are best applied to the training data, with the centering and scaling values retained and applied to the test data afterwards. This is because using information about the test set to train the model may bias model comparison metrics to be overly optimistic. This can result in over-fitting & selection of a bogus model.

Transformations of the second type can be applied without regard to train/test splits, because the modified value of each observation depends only on the data about the observation itself, and not on any other data or observation(s).

This question has garnered some misleading answers. The rest of this answer is dedicated to showing how and why they are misleading.

The term "normalization" is ambiguous, and different authors and disciplines will use the term "normalization" in different ways. In the absence of a specific articulation of what "normalization" means, I think it's best to approach the question in the most general sense possible.

In this view, the question is not about `sklearn.preprocessing.Normalizer`

specifically. Indeed, the `Normalizer`

class is not mentioned in the question. For that matter, no software, programming language or library is mentioned, either. Moreover, even if the intent is to ask about `Normalizer`

, the answers are still misleading because they mischaracterize what `Normalizer`

does.

Even within the same library, the terminology can be inconsistent. For example, PyTorch implements normalize `torchvision.transforms.Normalize`

and `torch.nn.functional.normalize`

. One of these can be used to create output tensors with mean 0 and standard deviation 1, while the other creates outputs that have a norm of 1.

`Normalizer`

Class Does

What the The `Normalizer`

class is an example of (2) because it rescales each observation (row) *individually* so that the sum-of-squares is 1 for every row. (In the corner-case that a row has sum-of-squares equal to 0, no rescaling is done.) The first sentence of the documentation for the `Normalizer`

says

> Normalize samples individually to unit norm.

This simple test code validates this understanding:

```
X = np.arange(10).reshape((5, 2))
normalizer = preprocessing.Normalizer()
normalized_all_X = normalizer.transform(X)
sum_of_squares = np.square(normalized_all_X).sum(1)
print(np.allclose(sum_of_squares,np.ones_like(sum_of_squares)))
```

This prints `True`

because the result is an array of 1s, as described in the documentation.

The normalizer implements `fit`

, `transform`

and `fit_transform`

methods even though some of these are just "pass-through" methods. This is so that there is a consistent interface across preprocessing methods, **not** because the methods' behaviors needs to distinguish between different data partitions.

##### Misleading Presentation 1

**The Normalizer class does not subtract the column means**

Another answer writes:

> Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance.

Ok, so let's try this out. Using the code snippet from the answer, we have

```
X = np.arange(10).reshape((5, 2))
X_train = X[:3]
X_test = X[3:]
normalizer = preprocessing.Normalizer()
normalized_train_X = normalizer.fit_transform(X_train)
column_means_train_X = normalized_train_X.mean(0)
```

This is the value of `column_means_train_X`

. It is not zero!

```
[0.42516214 0.84670847]
```

If the column means had been subtracted from the columns, then the centered column means would be 0.0. (This is simple to prove. The sum of `n`

numbers `x=[x1,x2,x3,...,xn]`

is `S`

. The mean of those numbers is `S / n`

. Then we have `sum(x - S/n) = S - n * (S / n) = 0`

.)

We can write similar code to show that the columns have not been divided by the *variance*. (Neither have the columns been divided by the *standard deviation*, which would be the more usual choice).

##### Misleading Presentation 2

**Applying the Normalizer class to the whole data set does not change the result.**

> If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

This claim is true as far as it goes, but it has absolutely no bearing on the `Normalizer`

class. Indeed, **Giorgos Myrianthous's chosen example is actually immune to the effect that they are describing.**

If the `Normalizer`

class did involve the means of the features, then we would expect that the normalize results will change depending on which of our data are included in the training set.

For example, the sample mean is a weighted sum of every observation in the sample. If we were computing column means and subtracting them, the results of applying this to all of the data would differ from applying it to only the training data subset. But we've already demonstrated that `Normalizer`

doesn't subtract column means.

Furthermore, these tests show that applying `Normalizer`

to all of the data or just some of the data makes no difference for the results.

If we apply this method separately, we have

```
[[0. 1. ]
[0.5547002 0.83205029]
[0.62469505 0.78086881]]
[[0.65079137 0.7592566 ]
[0.66436384 0.74740932]]
```

And if we apply it together, we have

```
[[0. 1. ]
[0.5547002 0.83205029]
[0.62469505 0.78086881]
[0.65079137 0.7592566 ]
[0.66436384 0.74740932]]
```

where the only difference is that we have 2 arrays in the first case, due to partitioning. Let's just double-check that the combined arrays are the same:

```
normalized_train_X = normalizer.fit_transform(X_train)
normalized_test_X = normalizer.transform(X_test)
normalized_all_X = normalizer.transform(X)
assert np.allclose(np.vstack((normalized_train_X, normalized_test_X)),normalized_all_X )
```

No exception is raised; they're numerically identical.

But sklearn's transformers are sometimes stateful, so let's make a new object just to make sure this isn't some state-related behavior.

```
new_normalizer = preprocessing.Normalizer()
new_normalized_all_X = new_normalizer.fit_transform(X)
assert np.allclose(np.vstack((normalized_train_X, normalized_test_X)),new_normalized_all_X )
```

In the second case, we still have no exception raised.

We can conclude that for the `Normalizer`

class, it makes no difference if the data are partitioned or not.

## Solution 4 - Machine Learning

Ask yourself if your data will look different depending on whether you transform before or after your split. If you're doing a `log2`

transformation, the order doesn't matter because each value is transformed independently of the others. If you're scaling and centering your data, the order does matter because an outlier can drastically change the final distribution. You're allowing the test set to "spill over" and affect your training set, potentially causing overly optimistic performance measures.

For `R`

uses, the `caret`

package is good at handling test/train splits. You can add the argument `preProcess = c("scale", "center")`

to the `train`

function and it will automatically apply any transformation from the training data onto the test data.

Tl;dr - if the data is different depending on whether your normalize before or after your split, do it before