How to import a text file on AWS S3 into pandas without writing to disk

PythonPandasHerokuAmazon S3Boto3

Python Problem Overview


I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.

import io
import boto3
import os
import pandas as pd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"

s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]


pd.read_csv(file, header=14, delimiter="\t", low_memory=False)

the error is

OSError: Expected file path name or file-like object, got <class 'bytes'> type

How do I convert the response body into a format pandas will accept?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: initial_value must be str or None, not StreamingBody

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: 'StreamingBody' does not support the buffer interface

UPDATE - Using the following worked

file = response["Body"].read()

and

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

Python Solutions


Solution 1 - Python

pandas uses boto for read_csv, so you should be able to:

import boto
data = pd.read_csv('s3://bucket....csv')

If you need boto3 because you are on python3.4+, you can

import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

Since version 0.20.1 pandas uses s3fs, see answer below.

Solution 2 - Python

Now pandas can handle S3 URLs. You could simply do:

import pandas as pd
import s3fs

df = pd.read_csv('s3://bucket-name/file.csv')

You need to install s3fs if you don't have it. pip install s3fs

Authentication

If your S3 bucket is private and requires authentication, you have two options:

1- Add access credentials to your ~/.aws/credentials config file

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Or

2- Set the following environment variables with their proper values:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

Solution 3 - Python

This is now supported in latest pandas. See

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

eg.,

df = pd.read_csv('s3://pandas-test/tips.csv')

Solution 4 - Python

With s3fs it can be done as follow:

import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)

# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_csv(f)

# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_pickle(f)

Solution 5 - Python

For python 3.6+ Amazon now have a really nice library to use Pandas with their services, called awswrangler.

import awswrangler as wr
import boto3


# Boto3 session
session = boto3.session.Session(aws_access_key_id='XXXX', 
                                aws_secret_access_key='XXXX')

# Awswrangler pass forward all pd.read_csv() function args
df = wr.s3.read_csv(path='s3://bucket/path/',
                    boto3_session=session,
                    skiprows=2,
                    sep=';',
                    decimal=',',
                    na_values=['--'])

To install awswrangler: pip install awswrangler

Solution 6 - Python

Since the files can be too large, it is not wise to load them in the dataframe altogether. Hence, read line by line and save it in the dataframe. Yes, we can also provide the chunk size in the read_csv but then we have to maintain the number of rows read.

Hence, I came up with this engineering:

def create_file_object_for_streaming(self):
        print("creating file object for streaming")
        self.file_object = self.bucket.Object(key=self.package_s3_key)
        print("File object is: " + str(self.file_object))
        print("Object file created.")
        return self.file_object

for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
            row_string = StringIO(row)
            df = pd.read_csv(row_string, sep=",")

I also delete the df once work is done. del df

Solution 7 - Python

For text files, you can use below code with pipe-delimited file for example :-

import pandas as pd
import io
import boto3
s3_client = boto3.client('s3', use_ssl=False)
bucket = #
prefix = #
obj = s3_client.get_object(Bucket=bucket, Key=prefix+ filename)
df = pd.read_fwf((io.BytesIO(obj['Body'].read())) , encoding= 'unicode_escape', delimiter='|', error_bad_lines=False,header=None, dtype=str)

Solution 8 - Python

An option is to convert the csv to json via df.to_dict() and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.

from boto.s3.connection import S3Connection
import pandas as pd
import yaml

conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")

myKey.set_contents_from_string(str(df.to_dict()))

This will convert the df to a dict string, and then save that as json in S3. You can later read it in the same json format:

df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))

The other solutions are also good, but this is a little simpler. Yaml may not necessarily be required but you need something to parse the json string. If the S3 file doesn't necessarily need to be a CSV this can be a quick fix.

Solution 9 - Python

import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionalpalalpalView Question on Stackoverflow
Solution 1 - PythonStefanView Answer on Stackoverflow
Solution 2 - PythonSamView Answer on Stackoverflow
Solution 3 - PythonRaveen BeemsinghView Answer on Stackoverflow
Solution 4 - PythonDrorView Answer on Stackoverflow
Solution 5 - PythonRicardo MuttiView Answer on Stackoverflow
Solution 6 - Pythonaviral sanjayView Answer on Stackoverflow
Solution 7 - PythonHari_pbView Answer on Stackoverflow
Solution 8 - PythonbillmanHView Answer on Stackoverflow
Solution 9 - PythonZe TangView Answer on Stackoverflow