Can you upload to S3 using a stream rather than a local file?
PythonCsvAmazon S3BotoBufferingPython Problem Overview
I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?
Python Solutions
Solution 1 - Python
I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()
Solution 2 - Python
According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO
in ordinary way
Update: smart_open lib from @inquiring minds answer is better solution
Solution 3 - Python
We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
@action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})
Solution 4 - Python
Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())
Solution 5 - Python
There's an interesting code solution mentioned in a GitHub smart_open
issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3
is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3
S3 client's upload_fileobj()
method in conjunction with a target stream, not a file -- should work.
Solution 6 - Python
There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.
Solution 7 - Python
To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.