Etag definition changed in Amazon S3

Amazon S3 Problem Overview

I've used Amazon S3 a little bit for backups for some time. Usually, after I upload a file I check the MD5 sum matches to ensure I've made a good backup. S3 has the "etag" header which used to give this sum.

However, when I uploaded a large file recently the Etag no longer seems to be a md5 sum. It has extra digits and a hyphen "696df35ad1161afbeb6ea667e5dd5dab-2861" . I can't find any documentation about this changing. I've checked using the S3 management console and with Cyberduck.

I can't find any documentation about this change. Any pointers?

Amazon S3 Solutions

Solution 1 - Amazon S3

If any file is being uploaded with multipart then you will always get such type of ETag. But if you upload whole file as single file then you will get ETag as before.

Bucket Explorer providing you normal ETag till 5Gb upload in multipart operation. But more then it is not providing.

AWS:

> The ETag for an object created using the multipart upload api will contain one or more non-hexadecimal characters and/or will consist of less than 16 or more than 16 hexadecimal digits.

Reference: https://forums.aws.amazon.com/thread.jspa?messageID=203510#203510

Solution 2 - Amazon S3

Amazon S3 calculates Etag with a different algorithm (not MD5 Sum, as usually) when you upload a file using multipart.

This algorithm is detailed here : http://permalink.gmane.org/gmane.comp.file-systems.s3.s3tools/583

> "Calculate the MD5 hash for each uploaded part of the file, > concatenate the hashes into a single binary string and calculate the > MD5 hash of that result."

I just develop a tool in bash to calculate it, s3md5 : https://github.com/Teachnova/s3md5

For example, to calculate Etag of a file foo.bin that has been uploaded using multipart with chunk size of 15 MB, then

# s3md5 15 foo.bin

Now you can check integrity of a very big file (bigger than 5GB) because you can calculate the Etag of the local file and compares it with S3 Etag.

Solution 3 - Amazon S3

Also in python...

#!/usr/bin/env python3
import binascii
import hashlib
import os

# Max size in bytes before uploading in parts. 
AWS_UPLOAD_MAX_SIZE = 20 * 1024 * 1024
# Size of parts when uploading in parts
# note: 2022-01-27 bitnami-minio container uses 5 mib
AWS_UPLOAD_PART_SIZE = int(os.environ.get('AWS_UPLOAD_PART_SIZE', 5 * 1024 * 1024))

def md5sum(sourcePath):
    '''
    Function: md5sum
    Purpose: Get the md5 hash of a file stored in S3
    Returns: Returns the md5 hash that will match the ETag in S3    
    '''

    filesize = os.path.getsize(sourcePath)
    hash = hashlib.md5()

    if filesize > AWS_UPLOAD_MAX_SIZE:

        block_count = 0
        md5bytes = b""
        with open(sourcePath, "rb") as f:
            block = f.read(AWS_UPLOAD_PART_SIZE)
            while block:
                hash = hashlib.md5()
                hash.update(block)
                block = f.read(AWS_UPLOAD_PART_SIZE)
                md5bytes += binascii.unhexlify(hash.hexdigest())
                block_count += 1

        hash = hashlib.md5()
        hash.update(md5bytes)
        hexdigest = hash.hexdigest() + "-" + str(block_count)

    else:
        with open(sourcePath, "rb") as f:
            block = f.read(AWS_UPLOAD_PART_SIZE)
            while block:
                hash.update(block)
                block = f.read(AWS_UPLOAD_PART_SIZE)
        hexdigest = hash.hexdigest()
    return hexdigest

Solution 4 - Amazon S3

Here is an example in Go:

func GetEtag(path string, partSizeMb int) string {
	partSize := partSizeMb * 1024 * 1024
	content, _ := ioutil.ReadFile(path)
	size := len(content)
	contentToHash := content
	parts := 0

	if size > partSize {
		pos := 0
		contentToHash = make([]byte, 0)
		for size > pos {
			endpos := pos + partSize
			if endpos >= size {
				endpos = size
			}
			hash := md5.Sum(content[pos:endpos])
			contentToHash = append(contentToHash, hash[:]...)
			pos += partSize
			parts += 1
		}
	}

	hash := md5.Sum(contentToHash)
	etag := fmt.Sprintf("%x", hash)
	if parts > 0 {
		etag += fmt.Sprintf("-%d", parts)
	}
	return etag
}

This is just an example, you should handle errors and stuff

Solution 5 - Amazon S3

Here's a powershell function to calculate the Amazon ETag for a file:

$blocksize = (1024*1024*5)
$startblocks = (1024*1024*16)
function AmazonEtagHashForFile($filename) {
    $lines = 0
    [byte[]] $binHash = @()

    $md5 = [Security.Cryptography.HashAlgorithm]::Create("MD5")
    $reader = [System.IO.File]::Open($filename,"OPEN","READ")
    
    if ((Get-Item $filename).length -gt $startblocks) {
        $buf = new-object byte[] $blocksize
        while (($read_len = $reader.Read($buf,0,$buf.length)) -ne 0){
            $lines   += 1
            $binHash += $md5.ComputeHash($buf,0,$read_len)
        }
        $binHash=$md5.ComputeHash( $binHash )
    }
    else {
        $lines   = 1
        $binHash += $md5.ComputeHash($reader)
    }

    $reader.Close()
    
    $hash = [System.BitConverter]::ToString( $binHash )
    $hash = $hash.Replace("-","").ToLower()

    if ($lines -gt 1) {
        $hash = $hash + "-$lines"
    }

    return $hash
}

Solution 6 - Amazon S3

If you use multipart uploads, the "etag" is not the MD5 sum of the data (see https://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb/19896823#19896823). One can identify this case by the etag containing a dash, "-".

Now, the interesting question is how to get the actual MD5 sum of the data, without downloading? One easy way is to just "copy" the object onto itself, this requires no download:

s3cmd cp s3://bucket/key s3://bucket/key

This will cause S3 to recompute the MD5 sum and store it as "etag" of the just copied object. The "copy" command runs directly on S3, i.e., no object data is transferred to/from S3, so this requires little bandwidth! (Note: do not use s3cmd mv; this would delete your data.)

The underlying REST command is:

PUT /key HTTP/1.1
Host: bucket.s3.amazonaws.com
x-amz-copy-source: /buckey/key
x-amz-metadata-directive: COPY

Solution 7 - Amazon S3

Copying to s3 with aws s3 cp can use multipart uploads and the resulting etag will not be an md5, as others have written.

To upload files without multipart, use the lower level put-object command.

aws s3api put-object --bucket bucketname --key remote/file --body local/file

Solution 8 - Amazon S3

This AWS support page - How do I ensure data integrity of objects uploaded to or downloaded from Amazon S3? - describes a more reliable way to verify the integrity of your s3 backups.

Firstly determine the base64 encoded md5sum of the file you wish to upload:

$ md5_sum_base64="$( openssl md5 -binary my-file | base64 )"

Then use the s3api to upload the file:

$ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64"

Note the use of the --content-md5 flag, the help for this flag states:

--content-md5  (string)  The  base64-encoded  128-bit MD5 digest of the part data.

This does not say much about why to use this flag, but we can find this information in the API documentation for put object:

> To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.

Using this flag causes S3 to verify that the file hash serverside matches the specified value. If the hashes match s3 will return the ETag:

{
    "ETag": "\"599393a2c526c680119d84155d90f1e5\""
}

The ETag value will usually be the hexadecimal md5sum (see this question for some scenarios where this may not be the case).

If the hash does not match the one you specified you get an error.

A client error (InvalidDigest) occurred when calling the PutObject operation: The Content-MD5 you specified was invalid.

In addition to this you can also add the file md5sum to the file metadata as an additional check:

$ aws s3api put-object --bucket my-bucket --key my-file --body my-file --content-md5 "$md5_sum_base64" --metadata md5chksum="$md5_sum_base64"

After upload you can issue the head-object command to check the values.

$ aws s3api head-object --bucket my-bucket --key my-file
{
    "AcceptRanges": "bytes",
    "ContentType": "binary/octet-stream",
    "LastModified": "Thu, 31 Mar 2016 16:37:18 GMT",
    "ContentLength": 605,
    "ETag": "\"599393a2c526c680119d84155d90f1e5\"",
    "Metadata": {    
        "md5chksum": "WZOTosUmxoARnYQVXZDx5Q=="    
    }    
}

Here is a bash script that uses content md5 and adds metadata and then verifies that the values returned by S3 match the local hashes:

#!/bin/bash

set -euf -o pipefail

# assumes you have aws cli, jq installed

# change these if required
tmp_dir="$HOME/tmp"
s3_dir="foo"
s3_bucket="stack-overflow-example"
aws_region="ap-southeast-2"
aws_profile="my-profile"

test_dir="$tmp_dir/s3-md5sum-test"
file_name="MailHog_linux_amd64"
test_file_url="https://github.com/mailhog/MailHog/releases/download/v1.0.0/MailHog_linux_amd64"
s3_key="$s3_dir/$file_name"
return_dir="$( pwd )"

cd "$tmp_dir" || exit
mkdir "$test_dir"
cd "$test_dir" || exit

wget "$test_file_url"

md5_sum_hex="$( md5sum $file_name | awk '{ print $1 }' )"
md5_sum_base64="$( openssl md5 -binary $file_name | base64 )"

echo "$file_name hex    = $md5_sum_hex"
echo "$file_name base64 = $md5_sum_base64"

echo "Uploading $file_name to s3://$s3_bucket/$s3_dir/$file_name"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api put-object \
--bucket "$s3_bucket" \
--key "$s3_key" \
--body "$file_name" \
--metadata md5chksum="$md5_sum_base64" \
--content-md5 "$md5_sum_base64"

echo "Verifying sums match"

s3_md5_sum_hex=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.ETag' | sed 's/"//'g )
s3_md5_sum_base64=$( aws --profile "$aws_profile"  --region "$aws_region" s3api head-object --bucket "$s3_bucket" --key "$s3_key" | jq -r '.Metadata.md5chksum' )

if [ "$md5_sum_hex" == "$s3_md5_sum_hex" ] && [ "$md5_sum_base64" == "$s3_md5_sum_base64" ]; then
    echo "checksums match"
else
    echo "something is wrong checksums do not match:"

    cat <<EOM | column -t -s ' '
$file_name file hex:    $md5_sum_hex    s3 hex:    $s3_md5_sum_hex
$file_name file base64: $md5_sum_base64 s3 base64: $s3_md5_sum_base64
EOM

fi

echo "Cleaning up"
cd "$return_dir"
rm -rf "$test_dir"
aws \
--profile "$aws_profile" \
--region "$aws_region" \
s3api delete-object \
--bucket "$s3_bucket" \
--key "$s3_key"

Solution 9 - Amazon S3

Here is C# version

    string etag = HashOf("file.txt",8);

source code

    private string HashOf(string filename,int chunkSizeInMb)
    {
        string returnMD5 = string.Empty;
        int chunkSize = chunkSizeInMb * 1024 * 1024;

        using (var crypto = new MD5CryptoServiceProvider())
        {
            int hashLength = crypto.HashSize/8;
            
            using (var stream = File.OpenRead(filename))
            {
                if (stream.Length > chunkSize)
                {
                    int chunkCount = (int)Math.Ceiling((double)stream.Length/(double)chunkSize);

                    byte[] hash = new byte[chunkCount*hashLength];
                    Stream hashStream = new MemoryStream(hash);
                    
                    long nByteLeftToRead = stream.Length;
                    while (nByteLeftToRead > 0)
                    {
                        int nByteCurrentRead = (int)Math.Min(nByteLeftToRead, chunkSize);
                        byte[] buffer = new byte[nByteCurrentRead];
                        nByteLeftToRead -= stream.Read(buffer, 0, nByteCurrentRead);
                        
                        byte[] tmpHash = crypto.ComputeHash(buffer);

                        hashStream.Write(tmpHash, 0, hashLength);

                    }

                    returnMD5 = BitConverter.ToString(crypto.ComputeHash(hash)).Replace("-", string.Empty).ToLower()+"-"+ chunkCount;
                }
                else {
                    returnMD5 = BitConverter.ToString(crypto.ComputeHash(stream)).Replace("-", string.Empty).ToLower();
                    
                }
                stream.Close();
            }
        }
        return returnMD5;
    }

Solution 10 - Amazon S3

To go one step beyond the OP's question.. chances are, these chunked ETags are making your life difficult in trying to compare them client-side.

If you are publishing your artifacts to S3 using the awscli commands (cp, sync, etc), the default threshold at which multipart upload seems to be used is 10MB. Recent awscli releases allow you to configure this threshold, so you can disable multipart and get an easy to use MD5 ETag:

aws configure set default.s3.multipart_threshold 64MB

Full documentation here: http://docs.aws.amazon.com/cli/latest/topic/s3-config.html

A consequence of this could be downgraded upload performance (I honestly did not notice). But the result is that all files smaller than your configured threshold will now have normal MD5 hash ETags, making them much easier to delta client side.

This does require a somewhat recent awscli install. My previous version (1.2.9) did not support this option, so I had to upgrade to 1.10.x.

I was able to set my threshold up to 1024MB successfully.

Solution 11 - Amazon S3

Based on answers here, I wrote a Python implementation which correctly calculates both multi-part and single-part file ETags.

def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
    md5s = []

    with open(file_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)
            if not data:
                break
            md5s.append(hashlib.md5(data))

    if len(md5s) == 1:
        return '"{}"'.format(md5s[0].hexdigest())

    digests = b''.join(m.digest() for m in md5s)
    digests_md5 = hashlib.md5(digests)
    return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))

The default chunk_size is 8 MB used by the official aws cli tool, and it does multipart upload for 2+ chunks. It should work under both Python 2 and 3.

Solution 12 - Amazon S3

Improving on @Spedge's and @Rob's answer, here is a python3 md5 function that takes in a file-like and does not rely on being able to get the file size with os.path.getsize.

# Function : md5sum
# Purpose : Get the md5 hash of a file stored in S3
# Returns : Returns the md5 hash that will match the ETag in S3
# https://github.com/boto/boto3/blob/0cc6042615fd44c6822bd5be5a4019d0901e5dd2/boto3/s3/transfer.py#L169
def md5sum(file_like,
           multipart_threshold=8 * 1024 * 1024,
           multipart_chunksize=8 * 1024 * 1024):
    md5hash = hashlib.md5()
    file_like.seek(0)
    filesize = 0
    block_count = 0
    md5string = b''
    for block in iter(lambda: file_like.read(multipart_chunksize), b''):
        md5hash = hashlib.md5()
        md5hash.update(block)
        md5string += md5hash.digest()
        filesize += len(block)
        block_count += 1

    if filesize > multipart_threshold:
        md5hash = hashlib.md5()
        md5hash.update(md5string)
        md5hash = md5hash.hexdigest() + "-" + str(block_count)
    else:
        md5hash = md5hash.hexdigest()

    file_like.seek(0)
    return md5hash

Solution 13 - Amazon S3

Of course, the multipart upload of files could be common issue. In my case, I was serving static files through S3 and the etag of .js file was coming out to be different from the local file even while the content was the same.

Turns out that even while the content was the same, it was because the line endings were different. I fixed the line endings in my git repository, uploaded the changed files to S3 and it works fine now.

Solution 14 - Amazon S3

I built on r03's answer and have a standalone Go utility for this here: https://github.com/lambfrier/calc_s3_etag

Example usage:

$ dd if=/dev/zero bs=1M count=10 of=10M_file
$ calc_s3_etag 10M_file
669fdad9e309b552f1e9cf7b489c1f73-2
$ calc_s3_etag -chunksize=15 10M_file
9fbaeee0ccc66f9a8e3d3641dca37281-1

Solution 15 - Amazon S3

The python example works great, but when working with Bamboo, they set the part size to 5MB which is NON STANDARD!! (s3cmd is 15MB) Also adjusted to use 1024 to calculate bytes.

Revised to work for bamboo artifact s3 repos.

import hashlib
import binascii


# Max size in bytes before uploading in parts. 
AWS_UPLOAD_MAX_SIZE = 20 * 1024 * 1024
# Size of parts when uploading in parts
AWS_UPLOAD_PART_SIZE = 5 * 1024 * 1024

#
# Function : md5sum
# Purpose : Get the md5 hash of a file stored in S3
# Returns : Returns the md5 hash that will match the ETag in S3
def md5sum(sourcePath):

    filesize = os.path.getsize(sourcePath)
    hash = hashlib.md5()

    if filesize > AWS_UPLOAD_MAX_SIZE:

        block_count = 0
        md5string = ""
        with open(sourcePath, "rb") as f:
            for block in iter(lambda: f.read(AWS_UPLOAD_PART_SIZE), ""):
                hash = hashlib.md5()
                hash.update(block)
                md5string = md5string + binascii.unhexlify(hash.hexdigest())
                block_count += 1

        hash = hashlib.md5()
        hash.update(md5string)
        return hash.hexdigest() + "-" + str(block_count)

    else:
        with open(sourcePath, "rb") as f:
            for block in iter(lambda: f.read(AWS_UPLOAD_PART_SIZE), ""):
                hash.update(block)
        return hash.hexdigest()

Content Type	Original Author	Original Content on Stackoverflow
Question	jjh	View Question on Stackoverflow
Solution 1 - Amazon S3	Tej Kiran	View Answer on Stackoverflow
Solution 2 - Amazon S3	Antonio Espinosa	View Answer on Stackoverflow
Solution 3 - Amazon S3	Spedge	View Answer on Stackoverflow
Solution 4 - Amazon S3	roeland	View Answer on Stackoverflow
Solution 5 - Amazon S3	seanyboy	View Answer on Stackoverflow
Solution 6 - Amazon S3	hrr	View Answer on Stackoverflow
Solution 7 - Amazon S3	Synesso	View Answer on Stackoverflow
Solution 8 - Amazon S3	htaccess	View Answer on Stackoverflow
Solution 9 - Amazon S3	Pitipong Guntawong	View Answer on Stackoverflow
Solution 10 - Amazon S3	jdolan	View Answer on Stackoverflow
Solution 11 - Amazon S3	hyperknot	View Answer on Stackoverflow
Solution 12 - Amazon S3	Samuel Barbosa	View Answer on Stackoverflow
Solution 13 - Amazon S3	Gaurav Toshniwal	View Answer on Stackoverflow
Solution 14 - Amazon S3	lambfrier	View Answer on Stackoverflow
Solution 15 - Amazon S3	Rob	View Answer on Stackoverflow