How to split data into trainset and testset randomly?

PythonFile Io

Python Problem Overview


I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.

My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.

This could be achieved easily in Matlab

fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', '\n');
plist=randperm(100);
for i=1:50
    trainstring = C{plist(i)};
    fprintf(train_file,trainstring);
end
for i=51:100
    teststring = C{plist(i)};
    fprintf(test_file,teststring);
end

But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.

Python Solutions


Solution 1 - Python

This can be done similarly in Python using lists, (note that the whole list is shuffled in place).

import random

with open("datafile.txt", "rb") as f:
    data = f.read().split('\n')

random.shuffle(data)

train_data = data[:50]
test_data = data[50:]

Solution 2 - Python

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)

Solution 3 - Python

To answer @desmond.carros question, I modified the best answer as follows,

 import random
 file=open("datafile.txt","r")
 data=list()
 for line in file:
    data.append(line.split(#your preferred delimiter))
 file.close()
 random.shuffle(data)
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int((len(data)+1)*.80):] #Splits 20% data to test set

The code splits the entire dataset to 80% train and 20% test data

Solution 4 - Python

You could also use numpy. When your data is stored in a numpy.ndarray:

import numpy as np
from random import sample
l = 100 #length of data 
f = 50  #number of elements you need
indices = sample(range(l),f)

train_data = data[indices]
test_data = np.delete(data,indices)

Solution 5 - Python

You can try this approach

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.cross_validation.train_test_split(csv, train_size = 0.5)

UPDATE: train_test_split was moved to model_selection so the current way (scikit-learn 0.22.2) to do it is this:

import pandas
import sklearn
csv = pandas.read_csv('data.csv')
train, test = sklearn.model_selection.train_test_split(csv, train_size = 0.5)

Solution 6 - Python

sklearn.cross_validation is deprecated since version 0.18, instead you should use sklearn.model_selection as show below

from sklearn.model_selection import train_test_split
import numpy

with open("datafile.txt", "rb") as f:
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   x_train ,x_test = train_test_split(data,test_size=0.5)       #test_size=0.5(whole_data)

Solution 7 - Python

The following produces more general k-fold cross-validation splits. Your 50-50 partitioning would be achieved by making k=2 below, all you would have to to is to pick one of the two partitions produced. Note: I haven't tested the code, but I'm pretty sure it should work.

import random, math

def k_fold(myfile, myseed=11109, k=3):
    # Load data
    data = open(myfile).readlines()
        
    # Shuffle input
    random.seed=myseed
    random.shuffle(data)

    # Compute partition size given input k
    len_part=int(math.ceil(len(data)/float(k)))

    # Create one partition per fold
    train={}
    test={}
    for ii in range(k):
        test[ii]  = data[ii*len_part:ii*len_part+len_part]
        train[ii] = [jj for jj in data if jj not in test[ii]]
        
    return train, test      
 

Solution 8 - Python

A quick note for the answer from @subin sahayam

 import random
 file=open("datafile.txt","r")
 data=list()
 for line in file:
    data.append(line.split(#your preferred delimiter))
 file.close()
 random.shuffle(data)
 train_data = data[:int((len(data)+1)*.80)] #Remaining 80% to training set
 test_data = data[int(len(data)*.80+1):] #Splits 20% data to test set

If your list size is a even number, you should not add the 1 in the code below. Instead, you need to check the size of the list first and then determine if you need to add the 1. >test_data = data[int(len(data)*.80+1):]

Solution 9 - Python

Well first of all there's no such thing as "arrays" in Python, Python uses lists and that does make a difference, I suggest you use NumPy which is a pretty good library for Python and it adds a lot of Matlab-like functionality.You can get started here Numpy for Matlab users

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionFreya RenView Question on Stackoverflow
Solution 1 - PythonijmarshallView Answer on Stackoverflow
Solution 2 - PythonshubhranshuView Answer on Stackoverflow
Solution 3 - Pythonsubin sahayamView Answer on Stackoverflow
Solution 4 - PythonJLTView Answer on Stackoverflow
Solution 5 - PythonRoman GhertaView Answer on Stackoverflow
Solution 6 - PythonAndrewView Answer on Stackoverflow
Solution 7 - PythonLord Henry WottonView Answer on Stackoverflow
Solution 8 - PythonleeView Answer on Stackoverflow
Solution 9 - Pythonaehs29View Answer on Stackoverflow