How to split CSV files as per number of rows specified?

LinuxUnixCsvSplit

Linux Problem Overview


I've CSV file (around 10,000 rows ; each row having 300 columns) stored on LINUX server. I want to break this CSV file into 500 CSV files of 20 records each. (Each having same CSV header as present in original CSV)

Is there any linux command to help this conversion?

Linux Solutions


Solution 1 - Linux

Use the Linux split command:

split -l 20 file.txt new    

Split the file "file.txt" into files beginning with the name "new" each containing 20 lines of text each.

Type man split at the Unix prompt for more information. However you will have to first remove the header from file.txt (using the tail command, for example) and then add it back on to each of the split files.

Solution 2 - Linux

Made it into a function. You can now call splitCsv <Filename> [chunkSize]

splitCsv() {
    HEADER=$(head -1 $1)
    if [ -n "$2" ]; then
        CHUNK=$2
    else 
        CHUNK=1000
    fi
    tail -n +2 $1 | split -l $CHUNK - $1_split_
    for i in $1_split_*; do
        sed -i -e "1i$HEADER" "$i"
    done
}

Found on: http://edmondscommerce.github.io/linux/linux-split-file-eg-csv-and-keep-header-row.html

Solution 3 - Linux

This should work !!!

file_name = Name of the file you want to split.
10000 = Number of rows each split file would contain
file_part_ = Prefix of split file name (file_part_0,file_part_1,file_part_2..etc goes on)

> split -d -l 10000 file_name.csv file_part_

Solution 4 - Linux

One-liner which preserves the header row in each split file. This example gives you 999 lines of data and one header row per file.

cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'

https://stackoverflow.com/a/53062251/401226 where the answer has comments about installing the correct version of parallel (in ubuntu use the specific parallel package, which is more recent than what is bundled in moreutils)

Solution 5 - Linux

This should do it for you - all your files will end up called Part1-Part500.

#!/bin/bash
FILENAME=10000.csv
HDR=$(head -1 $FILENAME)   # Pick up CSV header line to apply to each file
split -l 20 $FILENAME xyz  # Split the file into chunks of 20 lines each
n=1
for f in xyz*              # Go through all newly created chunks
do
   echo $HDR > Part${n}    # Write out header to new file called "Part(n)"
   cat $f >> Part${n}      # Add in the 20 lines from the "split" command
   rm $f                   # Remove temporary file
   ((n++))                 # Increment name of output part
done

Solution 6 - Linux

This question was asked many years ago, but for future readers I'd like to mention that the most convenient tool for this purpose is xsv from https://github.com/BurntSushi/xsv

The split sub-command is meant to do exactly what has been asked in the original question. The documentation says:

split - Split one CSV file into many CSV files of N chunks

Each of the split chunks retains the header row.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionPawan MudeView Question on Stackoverflow
Solution 1 - LinuxJames KingView Answer on Stackoverflow
Solution 2 - LinuxMartin DinovView Answer on Stackoverflow
Solution 3 - LinuxSoumyaanshView Answer on Stackoverflow
Solution 4 - LinuxTim RichardsonView Answer on Stackoverflow
Solution 5 - LinuxMark SetchellView Answer on Stackoverflow
Solution 6 - LinuxSanView Answer on Stackoverflow