How to get pdf filename with Python requests?

PythonPdfPython RequestsFilenames

Python Problem Overview


I'm using the Python requests lib to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click download it already has a filename defined to save the pdf. How do I get this filename?

For example:

import requests
r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf')
print r.headers['content-type']  # prints 'application/pdf'

I checked the r.headers for anything interesting, but there's no filename in there. I was actually hoping for something like r.filename..

Does anybody know how I can get the filename of a downloaded PDF file with requests library?

Python Solutions


Solution 1 - Python

It is specified in an http header content-disposition. So to extract the name you would do:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]

Name extracted from the string via regular expression (re module).

Solution 2 - Python

Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition header, I parse it from the download URL:

import re
import requests
from requests.exceptions import RequestException


url = 'http://www.example.com/downloads/sample.pdf'

try:
    with requests.get(url) as r:

        fname = ''
        if "Content-Disposition" in r.headers.keys():
            fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
        else:
            fname = url.split("/")[-1]

        print(fname)
except RequestException as e:
    print(e)

There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.

Solution 3 - Python

Apparently, for this particular resource it is in:

r.headers['content-disposition']

Don't know if it is always the case, though.

Solution 4 - Python

easy python3 implementation to get filename from Content-Disposition:

import requests
response = requests.get(<your-url>)
print(response.headers.get("Content-Disposition").split("filename=")[1])

Solution 5 - Python

You can use werkzeug for options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

>>> import werkzeug


>>> werkzeug.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionkramer65View Question on Stackoverflow
Solution 1 - PythonEugene VView Answer on Stackoverflow
Solution 2 - PythonNilpoView Answer on Stackoverflow
Solution 3 - PythonMaksim SolovjovView Answer on Stackoverflow
Solution 4 - PythonAkhilesh JoshiView Answer on Stackoverflow
Solution 5 - PythonmyildirimView Answer on Stackoverflow