scikit-learn random state in splitting dataset

PythonRandomMachine LearningScikit Learn

Python Problem Overview


Can anyone tell me why we set random state to zero in splitting train and test set.

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=0)

I have seen situations like this where random state is set to 1!

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=1)

What is the consequence of this random state in cross validation as well?

Python Solutions


Solution 1 - Python

It doesn't matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code. By the way I have seen random_state=42 used in many official examples of scikit as well as elsewhere also.

random_state as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. In the documentation, it is stated that:

> If random_state is None or np.random, then a randomly-initialized RandomState object is returned.

> If random_state is an integer, then it is used to seed a new RandomState object.

> If random_state is a RandomState object, then it is passed through.

This is to check and validate the data when running the code multiple times. Setting random_state a fixed value will guarantee that same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.

Solution 2 - Python

If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.

Solution 3 - Python

when random_state set to an integer, train_test_split will return same results for each execution.

when random_state set to an None, train_test_split will return different results for each execution.

see below example:

from sklearn.model_selection import train_test_split

X_data = range(10)
y_data = range(10)

for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = 0) # zero or any other integer
    print(y_test)

print("*"*30)

for i in range(5): 
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = None)
    print(y_test)

Output:

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]


[4, 7, 6]

[4, 3, 7]

[8, 1, 4]

[9, 5, 8]

[6, 4, 5]

Solution 4 - Python

The random_state splits a randomly selected data but with a twist. And the twist is the order of the data will be same for a particular value of random_state.You need to understand that it's not a bool accpeted value. starting from 0 to any integer no, if you pass as random_state,it'll be a permanent order for it. Ex: the order you will get in random_state=0 remain same. After that if you execuit random_state=5 and again come back to random_state=0 you'll get the same order. And like 0 for all integer will go same. How ever random_state=None splits randomly each time.

If still having doubt watch this

Solution 5 - Python

If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 0 or 1 or 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

Solution 6 - Python

random_state is None by default which means every time when you run your program you will get different output because of splitting between train and test varies within.

random_state = any int value means every time when you run your program you will get tehe same output because of splitting between train and test does not varies within.

Solution 7 - Python

The random_state is an integer value which implies the selection of a random combination of train and test. When you set the test_size as 1/4 the there is a set generated of permutation and combination of train and test and each combination has one state. Suppose you have a dataset---> [1,2,3,4]

Train   |  Test   | State
[1,2,3]    [4]      **0**
[1,3,4]    [2]      **1**
[4,2,3]    [1]      **2**
[2,4,1]    [3]      **3**

We need it because while param tuning of model same state will considered again and again. So that there won't be any inference with the accuracy.

But in case of Random forest there is also similar story but in a different way w.r.t the variables.

Solution 8 - Python

We used the random_state parameter for reproducibility of the initial shuffling of training datasets after each epoch.

Solution 9 - Python

For multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets. It fixes the order of data for train_test_split

Solution 10 - Python

Lets say our dataset is having one feature and 10data points. X=[0,1,2,3,4,5,6,7,8,9] and lets say 0.3(30% is testset) is specified as test data percentage then we are going to have 10C3=120 different combinations of data.[Refer picture in link for tabular explanation]: https://i.stack.imgur.com/FZm4a.png

Based on the random number specified system will pick random state and assigns train and test data

Solution 11 - Python

In addition to what already said, different values of random state may produce different results during the training phase.

Internally, the train_test_split() function uses a seed that allows you to pseudorandomly separate the data into two groups: training and test set.

The number is pseudorandom because the same data subdivision corresponds to the same seed value. This aspect is very useful to ensure the reproducibility of the experiments.

Unfortunately, the use of one seed rather than another could lead to totally different datasets, and even modify the performance of the chosen Machine Learning model that receives the training set as input.

You can read the following article to deepen this aspect: https://towardsdatascience.com/why-you-should-not-trust-the-train-test-split-function-47cb9d353ad2

The article also shows a practical example.

You can also find other considerations in this article: https://towardsdatascience.com/is-a-small-dataset-risky-b664b8569a21

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionShelly View Question on Stackoverflow
Solution 1 - PythonVivek KumarView Answer on Stackoverflow
Solution 2 - PythonRishi BansalView Answer on Stackoverflow
Solution 3 - PythonSanView Answer on Stackoverflow
Solution 4 - PythonGaneshView Answer on Stackoverflow
Solution 5 - PythonFarzana KhanView Answer on Stackoverflow
Solution 6 - Pythonuser13140964View Answer on Stackoverflow
Solution 7 - PythonBabrit BeheraView Answer on Stackoverflow
Solution 8 - PythonDebasish BholView Answer on Stackoverflow
Solution 9 - PythonhariView Answer on Stackoverflow
Solution 10 - PythonsrihithaView Answer on Stackoverflow
Solution 11 - PythonAngelica Lo DucaView Answer on Stackoverflow