Load S3 Data into AWS SageMaker Notebook
PythonAmazon Web-ServicesAmazon S3Machine LearningAmazon SagemakerPython Problem Overview
I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysis.
I could use boto to grab the data from S3, but I'm wondering whether there is a more elegant method as part of the SageMaker framework to do this in my python code?
Thanks in advance for any advice.
Python Solutions
Solution 1 - Python
import boto3
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
Solution 2 - Python
In the simplest case you don't need boto3
, because you just read resources.
Then it's even simpler:
import pandas as pd
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
But as Prateek stated make sure to configure your SageMaker notebook instance to have access to s3. This is done at configuration step in Permissions > IAM role
Solution 3 - Python
If you have a look here it seems you can specify this in the InputDataConfig. Search for "S3DataSource" (ref) in the document. The first hit is even in Python, on page 25/26.
Solution 4 - Python
You could also access your bucket as your file system using s3fs
import s3fs
fs = s3fs.S3FileSystem()
# To List 5 files in your accessible bucket
fs.ls('s3://bucket-name/data/')[:5]
# open it directly
with fs.open(f's3://bucket-name/data/image.png') as f:
display(Image.open(f))
Solution 5 - Python
Do make sure the Amazon SageMaker role has policy attached to it to have access to S3. It can be done in IAM.
Solution 6 - Python
You can also use AWS Data Wrangler https://github.com/awslabs/aws-data-wrangler:
import awswrangler as wr
df = wr.s3.read_csv(path="s3://...")
Solution 7 - Python
A similar answer with the f-string
.
import pandas as pd
bucket = 'your-bucket-name'
file = 'file.csv'
df = pd.read_csv(f"s3://{bucket}/{file}")
len(df) # print row counts
Solution 8 - Python
This code sample to import csv file from S3, tested at SageMaker notebook.
Use pip or conda to install s3fs. !pip install s3fs
import pandas as pd
my_bucket = '' #declare bucket name
my_file = 'aa/bb.csv' #declare file path
import boto3 # AWS Python SDK
from sagemaker import get_execution_role
role = get_execution_role()
data_location = 's3://{}/{}'.format(my_bucket,my_file)
data=pd.read_csv(data_location)
data.head(2)