How to split a file and keep the first line in each of the pieces?

Linux Problem Overview

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).

Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.

I am guessing some concoction of split and head will do the trick?

Linux Solutions

Solution 1 - Linux

This is robhruska's script cleaned up a bit:

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat "$file" >> tmp_file
    mv -f tmp_file "$file"
done

I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

Edit

Using GNU split it's possible to do this:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

Broken out for readability:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

Solution 2 - Linux

This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)

cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'

Based on Ole Tange's answer. (re Ole's answer: You can't use line count with pipepart)

See comments for some tips on installing parallel

Solution 3 - Linux

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):

tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

Solution 4 - Linux

You can use [mg]awk:

awk 'NR==1{
        header=$0; 
        count=1; 
        print header > "x_" count; 
        next 
     } 
     
     !( (NR-1) % 100){
        count++; 
        print header > "x_" count;
     } 
     {
        print $0 > "x_" count
     }' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

Solution 5 - Linux

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done

This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

Solution 6 - Linux

Use GNU Parallel:

parallel -a bigfile.csv --header : --pipepart 'cat > {#}'

If you need to run a command on each of the parts, then GNU Parallel can help do that, too:

parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}

If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):

parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

If you want to split into 10 MB blocks:

parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

Solution 7 - Linux

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.

trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT 
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat $file >> tmp_file
    mv -f tmp_file $file
done

Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

Solution 8 - Linux

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:

awk 'NR==1{print $0 > FILENAME ".split1";  print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file

Solution 9 - Linux

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.

Here's my version:

in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
	tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
    head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
    mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done

Differences:

in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines

Solution 10 - Linux

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.


csvheader=head -1 bigfile.csv
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00

Line by line explanation:

Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.

Solution 11 - Linux

Inspired by @Arkady's comment on a one-liner.

MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)

MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done

Evidence:

-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xaafoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xabfoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xacfoo
-rw-rw-r--  1 ec2-user ec2-user  32040110 Jun  1 23:18 mycsv.csv.xadfoo

and of course head -2 *foo to see the header is added.

Solution 12 - Linux

A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in. So something like:

head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt

Content Type	Original Author	Original Content on Stackoverflow
Question	Arkady	View Question on Stackoverflow
Solution 1 - Linux	Dennis Williamson	View Answer on Stackoverflow
Solution 2 - Linux	Tim Richardson	View Answer on Stackoverflow
Solution 3 - Linux	pixelbeat	View Answer on Stackoverflow
Solution 4 - Linux	marco	View Answer on Stackoverflow
Solution 5 - Linux	Rob Hruska	View Answer on Stackoverflow
Solution 6 - Linux	Ole Tange	View Answer on Stackoverflow
Solution 7 - Linux	Sam Bisbee	View Answer on Stackoverflow
Solution 8 - Linux	DreamFlasher	View Answer on Stackoverflow
Solution 9 - Linux	Garren S	View Answer on Stackoverflow
Solution 10 - Linux	Thyag	View Answer on Stackoverflow
Solution 11 - Linux	user1043620	View Answer on Stackoverflow
Solution 12 - Linux	Llewellyn Hinkes Jones	View Answer on Stackoverflow