Can you upload to S3 using a stream rather than a local file?

PythonCsvAmazon S3BotoBuffering

Python Problem Overview


I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.

Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:

import csv
import csv
import io
import boto
from boto.s3.key import Key


conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'

fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())

I received this error: BotoClientError: s3 does not support chunked transfer

UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:

conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'

testDict = [{
    "fieldA": "8",
    "fieldB": None,
    "fieldC": "888888888888"},
    {
    "fieldA": "9",
    "fieldB": None,
    "fieldC": "99999999999"}]

f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())

for row in testDict:
    writer.writerow(row)
    k.set_contents_from_string(f.getvalue())

f.close()

Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:

f.seek(0)
f.truncate(0)

to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?

Python Solutions


Solution 1 - Python

I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.

import smart_open
import io
import csv

testDict = [{
	"fieldA": "8",
	"fieldB": None,
	"fieldC": "888888888888"},
	{
	"fieldA": "9",
	"fieldB": None,
	"fieldC": "99999999999"}]

fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
	writer = csv.DictWriter(f, fieldnames=fieldnames)
	writer.writeheader()
	fout.write(f.getvalue())

	for row in testDict:
		f.seek(0)
		f.truncate(0)
		writer.writerow(row)
		fout.write(f.getvalue())
	
f.close()

Solution 2 - Python

According to docs it's possible

s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))

so we can use StringIO in ordinary way

Update: smart_open lib from @inquiring minds answer is better solution

Solution 3 - Python

We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:

@action(detail=False, methods=['post'])
def upload_document(self, request):
     document = request.data.get('image').file
     s3.upload_fileobj(document, BUCKET_NAME, 
                                 DESIRED_NAME_OF_FILE_IN_S3, 
                                 ExtraArgs={"ServerSideEncryption": "aws:kms"})

Solution 4 - Python

Here is a complete example using boto3

import boto3
import io

session = boto3.Session(
    aws_access_key_id="...",
    aws_secret_access_key="..."
)

s3 = session.resource("s3")

buff = io.BytesIO()

buff.write("test1\n".encode())
buff.write("test2\n".encode())

s3.Object(bucket, keypath).put(Body=buff.getvalue())

Solution 5 - Python

There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:

csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)

gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
    gz.write(csv_data.getvalue())
gz_stream.seek(0)

s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)

This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.

Solution 6 - Python

There's a well supported library for doing just this:

pip install s3fs

s3fs is really trivial to use:

import s3fs

s3fs.S3FileSystem(anon=False)

with s3.open('mybucket/new-file', 'wb') as f:
    f.write(2*2**20 * b'a')
    f.write(2*2**20 * b'a')

Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.

This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.

Solution 7 - Python

To write a string to an S3 object, use:

s3.Object('my_bucket', 'my_file.txt').put('Hello there')

So convert the stream to string and you're there.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questioninquiring mindsView Question on Stackoverflow
Solution 1 - Pythoninquiring mindsView Answer on Stackoverflow
Solution 2 - PythonEl RusoView Answer on Stackoverflow
Solution 3 - PythonSean Saúl AstrakhanView Answer on Stackoverflow
Solution 4 - PythonScottView Answer on Stackoverflow
Solution 5 - PythonMass Dot NetView Answer on Stackoverflow
Solution 6 - PythonPhilip CoulingView Answer on Stackoverflow
Solution 7 - PythonSamView Answer on Stackoverflow