Merge PDF files

PythonPdfFile IoPypdf2Pypdf

Python Problem Overview


Is it possible, using Python, to merge separate PDF files?

Assuming so, I need to extend this a little further. I am hoping to loop through folders in a directory and repeat this procedure.

And I may be pushing my luck, but is it possible to exclude a page that is contained in each of the PDFs (my report generation always creates an extra blank page).

Python Solutions


Solution 1 - Python

You can use PyPdf2s PdfMerger class.

File Concatenation

You can simply concatenate files by using the append method.

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf', 'file4.pdf']

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("result.pdf")
merger.close()

You can pass file handles instead file paths if you want.

File Merging

If you want more fine grained control of merging there is a merge method of the PdfMerger, which allows you to specify an insertion point in the output file, meaning you can insert the pages anywhere in the file. The append method can be thought of as a merge where the insertion point is the end of the file.

e.g.

merger.merge(2, pdf)

Here we insert the whole pdf into the output but at page 2.

Page Ranges

If you wish to control which pages are appended from a particular file, you can use the pages keyword argument of append and merge, passing a tuple in the form (start, stop[, step]) (like the regular range function).

e.g.

merger.append(pdf, pages=(0, 3))    # first 3 pages
merger.append(pdf, pages=(0, 6, 2)) # pages 1,3, 5

If you specify an invalid range you will get an IndexError.

Note: also that to avoid files being left open, the PdfFileMergers close method should be called when the merged file has been written. This ensures all files are closed (input and output) in a timely manner. It's a shame that PdfFileMerger isn't implemented as a context manager, so we can use the with keyword, avoid the explicit close call and get some easy exception safety.

You might also want to look at the pdfcat script provided as part of pypdf2. You can potentially avoid the need to write code altogether.

The PyPdf2 github also includes some example code demonstrating merging.

PyMuPdf

Another library perhaps worth a look is PyMuPdf. Merging is equally simple.

From command line:

python -m fitz join -o result.pdf file1.pdf file2.pdf file3.pdf

and from code

import fitz

result = fitz.open()

for pdf in ['file1.pdf', 'file2.pdf', 'file3.pdf']:
    with fitz.open(pdf) as mfile:
        result.insertPDF(mfile)
    
result.save("result.pdf")

With plenty of options, detailed in the projects wiki.

Solution 2 - Python

Use Pypdf or its successor PyPDF2:

> A Pure-Python library built as a PDF toolkit. It is capable of:
> * splitting documents page by page,
> * merging documents page by page,

(and much more)

Here's a sample program that works with both versions.

#!/usr/bin/env python
import sys
try:
    from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
    from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
    input_streams = []
    try:
        # First open all the files, then produce the output file, and
        # finally close the input files. This is necessary because
        # the data isn't read from the input files until the write
        # operation. Thanks to
        # https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
        for input_file in input_files:
            input_streams.append(open(input_file, 'rb'))
        writer = PdfFileWriter()
        for reader in map(PdfFileReader, input_streams):
            for n in range(reader.getNumPages()):
                writer.addPage(reader.getPage(n))
        writer.write(output_stream)
    finally:
        for f in input_streams:
            f.close()
        output_stream.close()

if __name__ == '__main__':
    if sys.platform == "win32":
        import os, msvcrt
        msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    pdf_cat(sys.argv[1:], sys.stdout)

Solution 3 - Python

Merge all pdf files that are present in a dir

Put the pdf files in a dir. Launch the program. You get one pdf with all the pdfs merged.

import os
from PyPDF2 import PdfFileMerger

x = [a for a in os.listdir() if a.endswith(".pdf")]

merger = PdfFileMerger()

for pdf in x:
    merger.append(open(pdf, 'rb'))

with open("result.pdf", "wb") as fout:
    merger.write(fout)

How would I make the same code above today

from glob import glob
from PyPDF2 import PdfFileMerger



def pdf_merge():
	''' Merges all the pdf files in current directory '''
	merger = PdfFileMerger()
	allpdfs = [a for a in glob("*.pdf")]
	[merger.append(pdf) for pdf in allpdfs]
	with open("Merged_pdfs.pdf", "wb") as new_file:
	    merger.write(new_file)


if __name__ == "__main__":
	pdf_merge()

Solution 4 - Python

The pdfrw library can do this quite easily, assuming you don't need to preserve bookmarks and annotations, and your PDFs aren't encrypted. cat.py is an example concatenation script, and subset.py is an example page subsetting script.

The relevant part of the concatenation script -- assumes inputs is a list of input filenames, and outfn is an output file name:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

As you can see from this, it would be pretty easy to leave out the last page, e.g. something like:

    writer.addpages(PdfReader(inpfn).pages[:-1])

Disclaimer: I am the primary pdfrw author.

Solution 5 - Python

Is it possible, using Python, to merge seperate PDF files?

Yes.

The following example merges all files in one folder to a single new PDF file:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os

def merge(path, output_filename):
    output = PdfFileWriter()

    for pdffile in glob(path + os.sep + '*.pdf'):
        if pdffile == output_filename:
            continue
        print("Parse '%s'" % pdffile)
        document = PdfFileReader(open(pdffile, 'rb'))
        for i in range(document.getNumPages()):
            output.addPage(document.getPage(i))

    print("Start writing '%s'" % output_filename)
    with open(output_filename, "wb") as f:
        output.write(f)

if __name__ == "__main__":
    parser = ArgumentParser()

    # Add more options if you like
    parser.add_argument("-o", "--output",
                        dest="output_filename",
                        default="merged.pdf",
                        help="write merged PDF to FILE",
                        metavar="FILE")
    parser.add_argument("-p", "--path",
                        dest="path",
                        default=".",
                        help="path of source PDF files")

    args = parser.parse_args()
    merge(args.path, args.output_filename)

Solution 6 - Python

from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))

def list_files(directory, extension):
    return (f for f in os.listdir(directory) if f.endswith('.' + extension))

pdfs = list_files(dir_path, "pdf")

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('result.pdf', 'wb') as fout:
    merger.write(fout)

webbrowser.open_new('file://'+ dir_path + '/result.pdf')

Git Repo: https://github.com/mahaguru24/Python_Merge_PDF.git

Solution 7 - Python

here, http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/, gives an solution.

similarly:

from pyPdf import PdfFileWriter, PdfFileReader

def append_pdf(input,output):
    [output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]

output = PdfFileWriter()

append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)

    output.write(file("c:\\combined.pdf","wb"))

Solution 8 - Python

You can use pikepdf too (source code documentation).

Example code could be (taken from the documentation):

from glob import glob

from pikepdf import Pdf

pdf = Pdf.new()

for file in glob('*.pdf'):  # you can change this to browse directories recursively
    with Pdf.open(file) as src:
        pdf.pages.extend(src.pages)

pdf.save('merged.pdf')
pdf.close()

If you want to exclude pages, you might proceed another way, for instance copying pages to a new pdf (you can select which ones you do not copy, then, the pdf.pages object behaving like a list).

It is still actively maintained, which, as of february 2022, does not seem to be the case of PyPDF2 nor pdfrw.

I haven't benchmarked it, so I don't know if it is quicker or slower than other solutions.

One advantage over PyMuPDF, in my case, is that an official Ubuntu package is available (python3-pikepdf), what is practical to package my own software depending on it.

Solution 9 - Python

A slight variation using a dictionary for greater flexibility (e.g. sort, dedup):

import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
    for file in files:
        filepath = subdir + os.sep + file
        # you can have multiple endswith
        if filepath.endswith((".pdf", ".PDF")):
            file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)

for k, v in file_dict.items():
    print(k, v)
    merger.append(v)

merger.write("combined_result.pdf")

Solution 10 - Python

You can use PdfFileMerger from the PyPDF2 module.

For example, to merge multiple PDF files from a list of paths you can use the following function:

from PyPDF2 import PdfFileMerger

# pass the path of the output final file.pdf and the list of paths
def merge_pdf(out_path: str, extracted_files: list [str]):
    merger   = PdfFileMerger()
    
    for pdf in extracted_files:
        merger.append(pdf)

    merger.write(out_path)
    merger.close()

merge_pdf('./final.pdf', extracted_files)

And this function to get all the files recursively from a parent folder:

import os

# pass the path of the parent_folder
def fetch_all_files(parent_folder: str):
    target_files = []
    for path, subdirs, files in os.walk(parent_folder):
        for name in files:
            target_files.append(os.path.join(path, name))
    return target_files 

# get a list of all the paths of the pdf
extracted_files = fetch_all_files('./parent_folder')

Finally, you use the two functions declaring.a parent_folder_path that can contain multiple documents, and an output_pdf_path for the destination of the merged PDF:

# get a list of all the paths of the pdf
parent_folder_path = './parent_folder'
outup_pdf_path     = './final.pdf'

extracted_files = fetch_all_files(parent_folder_path)
merge_pdf(outup_pdf_path, extracted_files)

You can get the full code from here (Source): How to merge PDF documents using Python

Solution 11 - Python

I used pdf unite on the linux terminal by leveraging subprocess (assumes one.pdf and two.pdf exist on the directory) and the aim is to merge them to three.pdf

 import subprocess
 subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)

Solution 12 - Python

The answer from Giovanni G. PY in an easily usable way (at least for me):

import os
from PyPDF2 import PdfFileMerger

def merge_pdfs(export_dir, input_dir, folder):
    current_dir = os.path.join(input_dir, folder)
    pdfs = os.listdir(current_dir)
    
    merger = PdfFileMerger()
    for pdf in pdfs:
        merger.append(open(os.path.join(current_dir, pdf), 'rb'))

    with open(os.path.join(export_dir, folder + ".pdf"), "wb") as fout:
        merger.write(fout)

export_dir = r"E:\Output"
input_dir = r"E:\Input"
folders = os.listdir(input_dir)
[merge_pdfs(export_dir, input_dir, folder) for folder in folders];

Solution 13 - Python

Here's a time comparison for the most common answers for my specific use case: combining a list of 5 large single-page pdf files. I ran each test twice.

(Disclaimer: I ran this function within Flask, your mileage may vary)

TL;DR

pdfrw is the fastest library for combining pdfs out of the 3 I tested.

PyPDF2

start = time.time()
merger = PdfFileMerger()
for pdf in all_pdf_obj:
    merger.append(
        os.path.join(
            os.getcwd(), pdf.filename # full path
                )
            )
formatted_name = f'Summary_Invoice_{date.today()}.pdf'
merge_file = os.path.join(os.getcwd(), formatted_name)
merger.write(merge_file)
merger.close()
end = time.time()
print(end - start) #1 66.50084733963013 #2 68.2995400428772

PyMuPDF

start = time.time()
result = fitz.open()

for pdf in all_pdf_obj:
    with fitz.open(os.path.join(os.getcwd(), pdf.filename)) as mfile:
        result.insertPDF(mfile)
formatted_name = f'Summary_Invoice_{date.today()}.pdf'

result.save(formatted_name)
end = time.time()
print(end - start) #1 2.7166640758514404 #2 1.694727897644043

pdfrw

start = time.time()
result = fitz.open()

writer = PdfWriter()
for pdf in all_pdf_obj:
    writer.addpages(PdfReader(os.path.join(os.getcwd(), pdf.filename)).pages)

formatted_name = f'Summary_Invoice_{date.today()}.pdf'
writer.write(formatted_name)
end = time.time()
print(end - start) #1 0.6040127277374268 #2 0.9576816558837891

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBtibert3View Question on Stackoverflow
Solution 1 - PythonPaul RooneyView Answer on Stackoverflow
Solution 2 - PythonGilles 'SO- stop being evil'View Answer on Stackoverflow
Solution 3 - PythonPythonProgrammiView Answer on Stackoverflow
Solution 4 - PythonPatrick MaupinView Answer on Stackoverflow
Solution 5 - PythonMartin ThomaView Answer on Stackoverflow
Solution 6 - Pythonguruprasad mulayView Answer on Stackoverflow
Solution 7 - PythonMark KView Answer on Stackoverflow
Solution 8 - PythonzezolloView Answer on Stackoverflow
Solution 9 - PythonOgaga UzohView Answer on Stackoverflow
Solution 10 - PythonDomenico RuggianoView Answer on Stackoverflow
Solution 11 - Pythonuser8291021View Answer on Stackoverflow
Solution 12 - PythonfaysouView Answer on Stackoverflow
Solution 13 - PythonpolymathView Answer on Stackoverflow