Use python requests to download CSV
PythonCsvPython RequestsPython Problem Overview
Here is my code:
import csv
import requests
with requests.Session() as s:
s.post(url, data=payload)
download = s.get('url that directly download a csv report')
This gives me the access to the csv file. I tried different method to deal with the download:
This will give the the csv file in one string:
print download.content
This print the first row and return error: _csv.Error: new-line character seen in unquoted field
cr = csv.reader(download, dialect=csv.excel_tab)
for row in cr:
print row
This will print a letter in each row and it won't print the whole thing:
cr = csv.reader(download.content, dialect=csv.excel_tab)
for row in cr:
print row
My question is: what's the most efficient way to read a csv file in this situation. And how to download it.
thanks
Python Solutions
Solution 1 - Python
This should help:
import csv
import requests
CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
with requests.Session() as s:
download = s.get(CSV_URL)
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
my_list = list(cr)
for row in my_list:
print(row)
Ouput sample:
['street', 'city', 'zip', 'state', 'beds', 'baths', 'sq__ft', 'type', 'sale_date', 'price', 'latitude', 'longitude']
['3526 HIGH ST', 'SACRAMENTO', '95838', 'CA', '2', '1', '836', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '59222', '38.631913', '-121.434879']
['51 OMAHA CT', 'SACRAMENTO', '95823', 'CA', '3', '1', '1167', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '68212', '38.478902', '-121.431028']
['2796 BRANCH ST', 'SACRAMENTO', '95815', 'CA', '2', '1', '796', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '68880', '38.618305', '-121.443839']
['2805 JANETTE WAY', 'SACRAMENTO', '95815', 'CA', '2', '1', '852', 'Residential', 'Wed May 21 00:00:00 EDT 2008', '69307', '38.616835', '-121.439146']
[...]
Related question with answer: https://stackoverflow.com/a/33079644/295246
Edit: Other answers are useful if you need to download large files (i.e. stream=True
).
Solution 2 - Python
To simplify these answers, and increase performance when downloading a large file, the below may work a bit more efficiently.
import requests
from contextlib import closing
import csv
from codecs import iterdecode
url = "http://download-and-process-csv-efficiently/python.csv"
with closing(requests.get(url, stream=True)) as r:
reader = iterdecode(csv.reader(r.iter_lines(), 'utf-8'),
delimiter=',',
quotechar='"')
for row in reader:
print(row)
By setting stream=True
in the GET request, when we pass r.iter_lines()
to csv.reader(), we are passing a generator to csv.reader(). By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader
.
This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.
Solution 3 - Python
I like the answers from The Aelfinn and aheld. I can improve them only by shortening a bit more, removing superfluous pieces, using a real data source, making it 2.x & 3.x-compatible, and maintaining the high-level of memory-efficiency seen elsewhere:
import csv
import requests
CSV_URL = 'http://web.cs.wpi.edu/~cs1004/a16/Resources/SacramentoRealEstateTransactions.csv'
with requests.get(CSV_URL, stream=True) as r:
lines = (line.decode('utf-8') for line in r.iter_lines())
for row in csv.reader(lines):
print(row)
Too bad 3.x is less flexible CSV-wise because the iterator must emit Unicode strings (while requests
does bytes
) while the 2.x-only version—for row in csv.reader(r.iter_lines()):
—is more Pythonic (shorter and easier-to-read). Anyhow, note the 2.x/3.x solution above won't handle the situation described by the OP where a NEWLINE is found unquoted in the data read.
For the part of the OP's question regarding downloading (vs. processing) the actual CSV file, here's another script that does that, 2.x & 3.x-compatible, minimal, readable, and memory-efficient:
import os
import requests
CSV_URL = 'http://web.cs.wpi.edu/~cs1004/a16/Resources/SacramentoRealEstateTransactions.csv'
with open(os.path.split(CSV_URL)[1], 'wb') as f, \
requests.get(CSV_URL, stream=True) as r:
for line in r.iter_lines():
f.write(line+'\n'.encode())
Solution 4 - Python
You can also use the DictReader
to iterate dictionaries of {'columnname': 'value', ...}
import csv
import requests
response = requests.get('http://example.test/foo.csv')
reader = csv.DictReader(response.iter_lines())
for record in reader:
print(record)
Solution 5 - Python
I use this code (I use Python 3):
import csv
import io
import requests
url = "http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv"
r = requests.get(url)
r.encoding = 'utf-8' # useful if encoding is not sent (or not sent properly) by the server
csvio = io.StringIO(r.text, newline="")
data = []
for row in csv.DictReader(csvio):
data.append(row)
Solution 6 - Python
From a little search, that I understand the file should be opened in universal newline mode, which you cannot directly do with a response content (I guess).
To finish the task, you can either save the downloaded content to a temporary file, or process it in memory.
Save as file:
import requests
import csv
import os
temp_file_name = 'temp_csv.csv'
url = 'http://url.to/file.csv'
download = requests.get(url)
with open(temp_file_name, 'w') as temp_file:
temp_file.writelines(download.content)
with open(temp_file_name, 'rU') as temp_file:
csv_reader = csv.reader(temp_file, dialect=csv.excel_tab)
for line in csv_reader:
print line
# delete the temp file after process
os.remove(temp_file_name)
In memory:
(To be updated)
Solution 7 - Python
To convert to Pandas DataFrame:
from io import StringIO
text=StringIO(download.content.decode('utf-8'))
df=pd.read_csv(text)
Solution 8 - Python
You can update the accepted answer with the iter_lines method of requests if the file is very large
import csv
import requests
CSV_URL = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
with requests.Session() as s:
download = s.get(CSV_URL)
line_iterator = (x.decode('utf-8') for x in download.iter_lines(decode_unicode=True))
cr = csv.reader(line_iterator, delimiter=',')
my_list = list(cr)
for row in my_list:
print(row)
Solution 9 - Python
I used below solution (Unfortunately others did not work for me):
import pandas as pd
df = pd.read_csv('http://.../file.csv')
Solution 10 - Python
The following approach worked well for me. I also did not need to use csv.reader()
or csv.writer()
functions, which I feel makes the code cleaner. The code is compatible with Python2 and Python 3.
from six.moves import urllib
DOWNLOAD_URL = "https://raw.githubusercontent.com/gjreda/gregreda.com/master/content/notebooks/data/city-of-chicago-salaries.csv"
DOWNLOAD_PATH ="datasets\city-of-chicago-salaries.csv"
urllib.request.urlretrieve(URL,DOWNLOAD_PATH)
Note - six is a package that helps in writing code that is compatible with both Python 2 and Python 3. For additional details regarding six see - What does from six.moves import urllib
do in Python?
Solution 11 - Python
Python3 Supported Code
with closing(requests.get(PHISHTANK_URL, stream=True})) as r:
reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'), delimiter=',', quotechar='"')
for record in reader:
print (record)
Solution 12 - Python
this worked nicely for me:
from csv import DictReader
f = requests.get('https://somedomain.com/file').content.decode('utf-8')
reader = DictReader(f.split('\n'))
csv_dict_list = list(reader)