how to merge 200 csv files in Python

PythonCsvMergeConcatenation

Python Problem Overview


Guys, I here have 200 separate csv files named from SH (1) to SH (200). I want to merge them into a single csv file. How can I do it?

Python Solutions


Solution 1 - Python

As ghostdog74 said, but this time with headers:

fout=open("out.csv","a")
# first file:
for line in open("sh1.csv"):
    fout.write(line)
# now the rest:    
for num in range(2,201):
    f = open("sh"+str(num)+".csv")
    f.next() # skip the header
    for line in f:
         fout.write(line)
    f.close() # not really needed
fout.close()

Solution 2 - Python

Why can't you just sed 1d sh*.csv > merged.csv?

Sometimes you don't even have to use python!

Solution 3 - Python

Use accepted StackOverflow answer to create a list of csv files that you want to append and then run this code:

import pandas as pd
combined_csv = pd.concat( [ pd.read_csv(f) for f in filenames ] )

And if you want to export it to a single csv file, use this:

combined_csv.to_csv( "combined_csv.csv", index=False )

Solution 4 - Python

fout=open("out.csv","a")
for num in range(1,201):
    for line in open("sh"+str(num)+".csv"):
         fout.write(line)    
fout.close()

Solution 5 - Python

I'm just going to throw another code example into the basket:

from glob import glob

with open('singleDataFile.csv', 'a') as singleFile:
	for csvFile in glob('*.csv'):
		for line in open(csvFile, 'r'):
			singleFile.write(line)

Solution 6 - Python

It depends what you mean by "merging" -- do they have the same columns? Do they have headers? For example, if they all have the same columns, and no headers, simple concatenation is sufficient (open the destination file for writing, loop over the sources opening each for reading, use shutil.copyfileobj from the open-for-reading source into the open-for-writing destination, close the source, keep looping -- use the with statement to do the closing on your behalf). If they have the same columns, but also headers, you'll need a readline on each source file except the first, after you open it for reading before you copy it into the destination, to skip the headers line.

If the CSV files don't all have the same columns then you need to define in what sense you're "merging" them (like a SQL JOIN? or "horizontally" if they all have the same number of lines? etc, etc) -- it's hard for us to guess what you mean in that case.

Solution 7 - Python

Quite easy to combine all files in a directory and merge them

import glob
import csv
 

# Open result file
with open('output.txt','wb') as fout:
    wout = csv.writer(fout,delimiter=',') 
    interesting_files = glob.glob("*.csv") 
    h = True
    for filename in interesting_files: 
        print 'Processing',filename 
        # Open and process file
        with open(filename,'rb') as fin:
            if h:
                h = False
            else:
                fin.next()#skip header
            for line in csv.reader(fin,delimiter=','):
                wout.writerow(line)

Solution 8 - Python

A slight change to the code above as it does not actually work correctly.

It should be as follows...

from glob import glob

with open('main.csv', 'a') as singleFile:
    for csv in glob('*.csv'):
        if csv == 'main.csv':
            pass
        else:
            for line in open(csv, 'r'):
                singleFile.write(line)

Solution 9 - Python

If you are working on linux/mac you can do this.

from subprocess import call
script="cat *.csv>merge.csv"
call(script,shell=True)

Solution 10 - Python

If the merged CSV is going to be used in Python then just use http://docs.python.org/library/glob.html">`glob`</a> to get a list of the files to pass to http://docs.python.org/library/fileinput.html#fileinput.input">`fileinput.input()`</a> via the files argument, then use the http://docs.python.org/library/csv.html">`csv`</a> module to read it all in one go.

Solution 11 - Python

OR, you could just do

cat sh*.csv > merged.csv

Solution 12 - Python

Over the solution that made @Adders and later on improved by @varun, I implemented some little improvement too leave the whole merged CSV with only the main header:

from glob import glob

filename = 'main.csv'

with open(filename, 'a') as singleFile:
    first_csv = True
    for csv in glob('*.csv'):
        if csv == filename:
            pass
        else:
            header = True
            for line in open(csv, 'r'):
                if first_csv and header:
                    singleFile.write(line)
                    first_csv = False
                    header = False
                elif header:
                    header = False
                else:
                    singleFile.write(line)
    singleFile.close()

Best regards!!!

Solution 13 - Python

You can simply use the in-built csv library. This solution will work even if some of your CSV files have slightly different column names or headers, unlike the other top-voted answers.

import csv
import glob


filenames = [i for i in glob.glob("SH*.csv")]
header_keys = []
merged_rows = []

for filename in filenames:
    with open(filename) as f:
        reader = csv.DictReader(f)
        merged_rows.extend(list(reader))
        header_keys.extend([key for key in reader.fieldnames if key not in header_keys])

with open("combined.csv", "w") as f:
    w = csv.DictWriter(f, fieldnames=header_keys)
    w.writeheader()
    w.writerows(merged_rows)

The merged file will contain all possible columns (header_keys) that can be found in the files. Any absent columns in a file would be rendered as blank / empty (but preserving rest of the file's data).

Note:

  • This won't work if your CSV files have no headers. In that case you can still use the csv library, but instead of using DictReader & DictWriter, you'll have to work with the basic reader & writer.
  • This may run into issues when you are dealing with massive data since the entirety of the content is being store in memory (merged_rows list).

Solution 14 - Python

You could import csv then loop through all the CSV files reading them into a list. Then write the list back out to disk.

import csv

rows = []

for f in (file1, file2, ...):
    reader = csv.reader(open("f", "rb"))

    for row in reader:
        rows.append(row)

writer = csv.writer(open("some.csv", "wb"))
writer.writerows("\n".join(rows))

The above is not very robust as it has no error handling nor does it close any open files. This should work whether or not the the individual files have one or more rows of CSV data in them. Also I did not run this code, but it should give you an idea of what to do.

Solution 15 - Python

I modified what @wisty said to be worked with python 3.x, for those of you that have encoding problem, also I use os module to avoid of hard coding

import os 
def merge_all():
    dir = os.chdir('C:\python\data\\')
    fout = open("merged_files.csv", "ab")
    # first file:
    for line in open("file_1.csv",'rb'):
        fout.write(line)
    # now the rest:
    list = os.listdir(dir)
    number_files = len(list)
    for num in range(2, number_files):
        f = open("file_" + str(num) + ".csv", 'rb')
        f.__next__()  # skip the header
        for line in f:
            fout.write(line)
        f.close()  # not really needed
    fout.close()

Solution 16 - Python

Here is a script:

  • Concatenating csv files named SH1.csv to SH200.csv

  • Keeping the headers

import glob
import re

# Looking for filenames like 'SH1.csv' ... 'SH200.csv'
pattern = re.compile("^SH([1-9]|[1-9][0-9]|1[0-9][0-9]|200).csv$")
file_parts = [name for name in glob.glob('*.csv') if pattern.match(name)]

with open("file_merged.csv","wb") as file_merged:
    for (i, name) in enumerate(file_parts):
        with open(name, "rb") as file_part:
            if i != 0:
                next(file_part) # skip headers if not first file
            file_merged.write(file_part.read())

Solution 17 - Python

Updating wisty's answer for python3

fout=open("out.csv","a")
# first file:
for line in open("sh1.csv"):
    fout.write(line)
# now the rest:    
for num in range(2,201):
    f = open("sh"+str(num)+".csv")
    next(f) # skip the header
    for line in f:
         fout.write(line)
    f.close() # not really needed
fout.close()

Solution 18 - Python

Let's say you have 2 csv files like these:

csv1.csv:

id,name
1,Armin
2,Sven

csv2.csv:

id,place,year
1,Reykjavik,2017
2,Amsterdam,2018
3,Berlin,2019

and you want the result to be like this csv3.csv:

id,name,place,year
1,Armin,Reykjavik,2017
2,Sven,Amsterdam,2018
3,,Berlin,2019

Then you can use the following snippet to do that:

import csv
import pandas as pd

# the file names
f1 = "csv1.csv"
f2 = "csv2.csv"
out_f = "csv3.csv"

# read the files
df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)

# get the keys
keys1 = list(df1)
keys2 = list(df2)

# merge both files
for idx, row in df2.iterrows():
	data = df1[df1['id'] == row['id']]

	# if row with such id does not exist, add the whole row
	if data.empty:
		next_idx = len(df1)
		for key in keys2:
			df1.at[next_idx, key] = df2.at[idx, key]

	# if row with such id exists, add only the missing keys with their values
	else:
		i = int(data.index[0])
		for key in keys2:
			if key not in keys1:
				df1.at[i, key] = df2.at[idx, key]

# save the merged files
df1.to_csv(out_f, index=False, encoding='utf-8', quotechar="", quoting=csv.QUOTE_NONE)

With the help of a loop you can achieve the same result for multiple files as it is in your case (200 csv files).

Solution 19 - Python

If the files aren't numbered in order, take the hassle-free approach below: Python 3.6 on windows machine:

import pandas as pd
from glob import glob

interesting_files = glob("C:/temp/*.csv") # it grabs all the csv files from the directory you mention here

df_list = []
for filename in sorted(interesting_files):

df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

# save the final file in same/different directory:
full_df.to_csv("C:/temp/merged_pandas.csv", index=False)

Solution 20 - Python

An easy-to-use function:

def csv_merge(destination_path, *source_paths):
'''
Merges all csv files on source_paths to destination_path.
:param destination_path: Path of a single csv file, doesn't need to exist
:param source_paths: Paths of csv files to be merged into, needs to exist
:return: None
'''
with open(destination_path,"a") as dest_file:
    with open(source_paths[0]) as src_file:
        for src_line in src_file.read():
            dest_file.write(src_line)
    source_paths.pop(0)
    for i in range(len(source_paths)):
        with open(source_paths[i]) as src_file:
            src_file.next()
            for src_line in src_file:
                 dest_file.write(src_line)

Solution 21 - Python

import pandas as pd
import os

df = pd.read_csv("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data\\Sales_April_2019.csv")
files = [file for file in  os.listdir("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data")
for file in files:
    print(file)

all_data = pd.DataFrame()
for file in files:
    df=pd.read_csv("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data\\"+file)
    all_data = pd.concat([all_data,df])
    all_data.head()

Solution 22 - Python

I have done it by implementing a function that expect output file and paths of the input files. The function copy the file content of the first file into the output file and then does the same for the rest of input files but without the header line.

def concat_files_with_header(output_file, *paths):
    for i, path in enumerate(paths):
        with open(path) as input_file:
            if i > 0:
                next(input_file)  # Skip header
            output_file.writelines(input_file)

Usage example of the function:

if __name__ == "__main__":
    paths = [f"sh{i}.csv" for i in range(1, 201)]
    with open("output.csv", "w") as output_file:
        concat_files_with_header(output_file, *paths)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionChuckView Question on Stackoverflow
Solution 1 - PythonwistyView Answer on Stackoverflow
Solution 2 - PythonblinsayView Answer on Stackoverflow
Solution 3 - PythonscottlittleView Answer on Stackoverflow
Solution 4 - Pythonghostdog74View Answer on Stackoverflow
Solution 5 - PythonNorfeldtView Answer on Stackoverflow
Solution 6 - PythonAlex MartelliView Answer on Stackoverflow
Solution 7 - PythonvarunView Answer on Stackoverflow
Solution 8 - PythonAddersView Answer on Stackoverflow
Solution 9 - PythonKondalarao VView Answer on Stackoverflow
Solution 10 - PythonIgnacio Vazquez-AbramsView Answer on Stackoverflow
Solution 11 - PythonNanashi No GombeView Answer on Stackoverflow
Solution 12 - PythonFabián Miranda MuñozView Answer on Stackoverflow
Solution 13 - Pythonshad0w_wa1k3rView Answer on Stackoverflow
Solution 14 - PythoncnobileView Answer on Stackoverflow
Solution 15 - PythonMaryam PashmiView Answer on Stackoverflow
Solution 16 - Pythonx0sView Answer on Stackoverflow
Solution 17 - Pythonishandutta2007View Answer on Stackoverflow
Solution 18 - Pythontsveti_ikoView Answer on Stackoverflow
Solution 19 - PythonAzadeh FeizpourView Answer on Stackoverflow
Solution 20 - PythonKen ShibataView Answer on Stackoverflow
Solution 21 - PythonSunit DeogamView Answer on Stackoverflow
Solution 22 - PythonDor MeiriView Answer on Stackoverflow