How to convert webpage into PDF by using Python

PythonHtmlPdfQprinter

Python Problem Overview


I was finding solution to print webpage into local file PDF, using Python. one of the good solution is to use Qt, found here, https://bharatikunal.wordpress.com/2010/01/.

It didn't work at the beginning as I had problem with the installation of PyQt4 because it gave error messages such as 'ImportError: No module named PyQt4.QtCore', and 'ImportError: No module named PyQt4.QtCore'.

It was because PyQt4's not installed properly. I used to have the libraries located at C:\Python27\Lib however it's not for PyQt4.

In fact, it simply needs to download from http://www.riverbankcomputing.com/software/pyqt/download (mind the correct Python version you are using), and install it to C:\Python27 (my case). That's it.

Now the scripts runs fine so I want to share it. for more options in using Qprinter, please refer to http://qt-project.org/doc/qt-4.8/qprinter.html#Orientation-enum.

Python Solutions


Solution 1 - Python

You also can use pdfkit:

Usage
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')
Install

MacOS: brew install Caskroom/cask/wkhtmltopdf

Debian/Ubuntu: apt-get install wkhtmltopdf

Windows: choco install wkhtmltopdf

See official documentation for MacOS/Ubuntu/other OS: https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf

Solution 2 - Python

WeasyPrint

pip install weasyprint  # No longer supports Python 2.x.

python
>>> import weasyprint
>>> pdf = weasyprint.HTML('http://www.google.com').write_pdf()
>>> len(pdf)
92059
>>> open('google.pdf', 'wb').write(pdf)

Solution 3 - Python

thanks to below posts, and I am able to add on the webpage link address to be printed and present time on the PDF generated, no matter how many pages it has.

https://stackoverflow.com/questions/1180115/add-text-to-existing-pdf-using-python

https://github.com/disflux/django-mtr/blob/master/pdfgen/doc_overlay.py

To share the script as below:

import time
from pyPdf import PdfFileWriter, PdfFileReader
import StringIO
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from xhtml2pdf import pisa
import sys 
from PyQt4.QtCore import *
from PyQt4.QtGui import * 
from PyQt4.QtWebKit import * 

url = 'http://www.yahoo.com'
tem_pdf = "c:\\tem_pdf.pdf"
final_file = "c:\\younameit.pdf"

app = QApplication(sys.argv)
web = QWebView()
#Read the URL given
web.load(QUrl(url))
printer = QPrinter()
#setting format
printer.setPageSize(QPrinter.A4)
printer.setOrientation(QPrinter.Landscape)
printer.setOutputFormat(QPrinter.PdfFormat)
#export file as c:\tem_pdf.pdf
printer.setOutputFileName(tem_pdf)

def convertIt():
    web.print_(printer)
    QApplication.exit()

QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)

app.exec_()
sys.exit

# Below is to add on the weblink as text and present date&time on PDF generated

outputPDF = PdfFileWriter()
packet = StringIO.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.setFont("Helvetica", 9)
# Writting the new line
oknow = time.strftime("%a, %d %b %Y %H:%M")
can.drawString(5, 2, url)
can.drawString(605, 2, oknow)
can.save()

#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file(tem_pdf, "rb"))
pages = existing_pdf.getNumPages()
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
for x in range(0,pages):
    page = existing_pdf.getPage(x)
    page.mergePage(new_pdf.getPage(0))
    output.addPage(page)
# finally, write "output" to a real file
outputStream = file(final_file, "wb")
output.write(outputStream)
outputStream.close()

print final_file, 'is ready.'

Solution 4 - Python

here is the one working fine:

import sys 
from PyQt4.QtCore import *
from PyQt4.QtGui import * 
from PyQt4.QtWebKit import * 

app = QApplication(sys.argv)
web = QWebView()
web.load(QUrl("http://www.yahoo.com"))
printer = QPrinter()
printer.setPageSize(QPrinter.A4)
printer.setOutputFormat(QPrinter.PdfFormat)
printer.setOutputFileName("fileOK.pdf")

def convertIt():
    web.print_(printer)
    print("Pdf generated")
    QApplication.exit()

QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)
sys.exit(app.exec_())

Solution 5 - Python

Per this answer: How to convert webpage into PDF by using Python, the advice was to use pdfkit. You also have to install wkhtmltopdf.

If you have a local .html file, you then need to use this command:

pdfkit.from_file('test.html', 'out.pdf')

But this will throw an error if you haven't added the wkhtmltopdf executables to your system path. This was the part that tripped me up and I wanted to share.

On Windows, open your environment variables and add them to your System variables > Path like below. In my case, these .exe files were located here after I installed the wkhtmltopdf from an exe:

C:\Program Files\wkhtmltopdf\bin

environment variables

Solution 6 - Python

Here is a simple solution using QT. I found this as part of an answer to a different question on StackOverFlow. I tested it on Windows.

from PyQt4.QtGui import QTextDocument, QPrinter, QApplication

import sys
app = QApplication(sys.argv)

doc = QTextDocument()
location = "c://apython//Jim//html//notes.html"
html = open(location).read()
doc.setHtml(html)

printer = QPrinter()
printer.setOutputFileName("foo.pdf")
printer.setOutputFormat(QPrinter.PdfFormat)
printer.setPageSize(QPrinter.A4);
printer.setPageMargins (15,15,15,15,QPrinter.Millimeter);

doc.print_(printer)
print "done!"

Solution 7 - Python

I tried @NorthCat answer using pdfkit.

It required wkhtmltopdf to be installed. The install can be downloaded from here. https://wkhtmltopdf.org/downloads.html

Install the executable file. Then write a line to indicate where wkhtmltopdf is, like below. (referenced from https://stackoverflow.com/questions/27673870/cant-create-pdf-using-python-pdfkit-error-no-wkhtmltopdf-executable-found

import pdfkit


path_wkthmltopdf = "C:\\Folder\\where\\wkhtmltopdf.exe"
config = pdfkit.configuration(wkhtmltopdf = path_wkthmltopdf)

pdfkit.from_url("http://google.com", "out.pdf", configuration=config)

Solution 8 - Python

This solution worked for me using PyQt5 version 5.15.0

import sys
from PyQt5 import QtWidgets, QtWebEngineWidgets
from PyQt5.QtCore import QUrl
from PyQt5.QtGui import QPageLayout, QPageSize
from PyQt5.QtWidgets import QApplication

if __name__ == '__main__':
    app = QtWidgets.QApplication(sys.argv)
    loader = QtWebEngineWidgets.QWebEngineView()
    loader.setZoomFactor(1)
    layout = QPageLayout()
    layout.setPageSize(QPageSize(QPageSize.A4Extra))
    layout.setOrientation(QPageLayout.Portrait)
    loader.load(QUrl('https://stackoverflow.com/questions/23359083/how-to-convert-webpage-into-pdf-by-using-python'))
    loader.page().pdfPrintingFinished.connect(lambda *args: QApplication.exit())

    def emit_pdf(finished):
        loader.page().printToPdf("test.pdf", pageLayout=layout)

    loader.loadFinished.connect(emit_pdf)
    sys.exit(app.exec_())

Solution 9 - Python

If you use selenium and chromium, you do not need to manage cookies by you self, and you can generate pdf page from chromium's print as pdf. You can refer this project to realize it. https://github.com/maxvst/python-selenium-chrome-html-to-pdf-converter

modified base > https://github.com/maxvst/python-selenium-chrome-html-to-pdf-converter/blob/master/sample/html_to_pdf_converter.py

import sys
import json, base64


def send_devtools(driver, cmd, params={}):
    resource = "/session/%s/chromium/send_command_and_get_result" % driver.session_id
    url = driver.command_executor._url + resource
    body = json.dumps({'cmd': cmd, 'params': params})
    response = driver.command_executor._request('POST', url, body)
    return response.get('value')


def get_pdf_from_html(driver, url, print_options={}, output_file_path="example.pdf"):
    driver.get(url)

    calculated_print_options = {
        'landscape': False,
        'displayHeaderFooter': False,
        'printBackground': True,
        'preferCSSPageSize': True,
    }
    calculated_print_options.update(print_options)
    result = send_devtools(driver, "Page.printToPDF", calculated_print_options)
    data = base64.b64decode(result['data'])
    with open(output_file_path, "wb") as f:
        f.write(data)



# example
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://stackoverflow.com/questions/23359083/how-to-convert-webpage-into-pdf-by-using-python#"
webdriver_options = Options()
webdriver_options.add_argument("--no-sandbox")
webdriver_options.add_argument('--headless')
webdriver_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chromedriver, options=webdriver_options)
get_pdf_from_html(driver, url)
driver.quit()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMark KView Question on Stackoverflow
Solution 1 - PythonNorthCatView Answer on Stackoverflow
Solution 2 - PythonJohnMuddView Answer on Stackoverflow
Solution 3 - PythonMark KView Answer on Stackoverflow
Solution 4 - PythonMark KView Answer on Stackoverflow
Solution 5 - PythonJaradView Answer on Stackoverflow
Solution 6 - PythonJim PaulView Answer on Stackoverflow
Solution 7 - PythonMark KView Answer on Stackoverflow
Solution 8 - PythonY.khView Answer on Stackoverflow
Solution 9 - PythonYuanmeng XiaoView Answer on Stackoverflow