fastest way convert tab-delimited file to csv in linux

Linux Problem Overview

I have a tab-delimited file that has over 200 million lines. What's the fastest way in linux to convert this to a csv file? This file does have multiple lines of header information which I'll need to strip out down the road, but the number of lines of header is known. I have seen suggestions for sed and gawk, but I wonder if there is a "preferred" choice.

Just to clarify, there are no embedded tabs in this file.

Linux Solutions

Solution 1 - Linux

If you're worried about embedded commas then you'll need to use a slightly more intelligent method. Here's a Python script that takes TSV lines from stdin and writes CSV lines to stdout:

import sys
import csv

tabin = csv.reader(sys.stdin, dialect=csv.excel_tab)
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in tabin:
  commaout.writerow(row)

Run it from a shell as follows:

python script.py < input.tsv > output.csv

Solution 2 - Linux

If all you need to do is translate all tab characters to comma characters, tr is probably the way to go.

The blank space here is a literal tab:

$ echo "hello   world" | tr "\\t" ","
hello,world

Of course, if you have embedded tabs inside string literals in the file, this will incorrectly translate those as well; but embedded literal tabs would be fairly uncommon.

Solution 3 - Linux

perl -lpe 's/"/""/g; s/^|$/"/g; s/\t/","/g' < input.tab > output.csv

Perl is generally faster at this sort of thing than the sed, awk, and Python.

Solution 4 - Linux

If you want to convert the whole tsv file into a csv file:
```
  $ cat data.tsv | tr "\\t" "," > data.csv
```

If you want to omit some fields:
```
  $ cat data.tsv | cut -f1,2,3 | tr "\\t" "," > data.csv
```
The above command will convert the data.tsv file to data.csv file containing only the first three fields.

Solution 5 - Linux

sed -e 's/"/\\"/g' -e 's/<tab>/","/g' -e 's/^/"/' -e 's/$/"/' infile > outfile

Damn the critics, quote everything, CSV doesn't care.

<tab> is the actual tab character. \t didn't work for me. In bash, use ^V to enter it.

Solution 6 - Linux

@ignacio-vazquez-abrams 's python solution is great! For people who are looking to parse delimiters other tab, the library actually allows you to set arbitrary delimiter. Here is my modified version to handle pipe-delimited files:

import sys
import csv

pipein = csv.reader(sys.stdin, delimiter='|')
commaout = csv.writer(sys.stdout, dialect=csv.excel)
for row in pipein:
  commaout.writerow(row)

Solution 7 - Linux

assuming you don't want to change header and assuming you don't have embedded tabs

# cat file
header  header  header
one     two     three

$ awk 'NR>1{$1=$1}1' OFS="," file
header  header  header
one,two,three

NR>1 skips the first header. you mentioned you know how many lines of header, so use the correct number for your own case. with this, you also do not need to call any other external commands. just one awk command does the job.

another way if you have blank columns and you care about that.

awk 'NR>1{gsub("\t",",")}1' file

using sed

sed '2,$y/\t/,/' file #skip 1 line header and translate (same as tr)

Solution 8 - Linux

You can also use xsv for this

xsv input -d '\t' input.tsv > output.csv

In my test on a 300MB tsv file, it was roughly 5x faster than the python solution (2.5s vs. 14s).

Solution 9 - Linux

the following awk oneliner supports quoting + quote-escaping

printf "flop\tflap\"" | awk -F '\t' '{ gsub(/"/,"\"\"\"",$i); for(i = 1; i <= NF; i++) { printf "\"%s\"",$i; if( i < NF ) printf "," }; printf "\n" }'

gives

"flop","flap""""

Solution 10 - Linux

right click file, click rename, delete the 't' and put a 'c'. I'm actually not joking, most csv parsers can handle tab delimiters. I had this issue now and for my purposes renaming worked just fine.

Solution 11 - Linux

I think it is better not to cat the file because it may create problem in the case of large file. The better way may be

$ tr ',' '\t' < csvfile.csv > tabdelimitedFile.txt

The command will get input from csvfile.csv and store the result as tab seperated in tabdelimitedFile.txt

Content Type	Original Author	Original Content on Stackoverflow
Question	andrewj	View Question on Stackoverflow
Solution 1 - Linux	Ignacio Vazquez-Abrams	View Answer on Stackoverflow
Solution 2 - Linux	Mark Rushakoff	View Answer on Stackoverflow
Solution 3 - Linux	pabs	View Answer on Stackoverflow
Solution 4 - Linux	Gopal Kumar	View Answer on Stackoverflow
Solution 5 - Linux	Will Hartung	View Answer on Stackoverflow
Solution 6 - Linux	jtlai	View Answer on Stackoverflow
Solution 7 - Linux	ghostdog74	View Answer on Stackoverflow
Solution 8 - Linux	mloughran	View Answer on Stackoverflow
Solution 9 - Linux	coderofsalvation	View Answer on Stackoverflow
Solution 10 - Linux	Neil	View Answer on Stackoverflow
Solution 11 - Linux	Mian Asbat Ahmad	View Answer on Stackoverflow