Introducing default data integrity protections for new objects in Amazon S3

At Amazon Web Services (AWS), the vast majority of new capabilities are driven by your direct feedback. Two years ago, Jeff announced additional checksum algorithms and the optional client-side computation of checksums to make sure the objects stored on Amazon S3 are exactly what you sent. You told us you love this extra verification because it gives you confidence the object stored is the one you sent. You also told us you would prefer to have this extra verification enabled automatically, freeing you from developing additional code.

Starting today, we’re updating the Amazon Simple Storage Service (Amazon S3) default behavior when you upload objects. To build upon its existing durability posture, Amazon S3 now automatically verifies that your data is correctly transmitted over the network from your applications to your S3 bucket.

Amazon S3 is designed for 99.999999999% data durability (that’s 11 nines). Amazon S3 has always verified the integrity of object uploads by calculating checksums when objects reach our servers, before they are written to multiple storage devices. Once your data is stored in Amazon S3, it continually monitors data durability over time with periodic integrity checks of data at rest. Amazon S3 also actively monitors the redundancy of your data to help verify that your objects can tolerate the concurrent failure of multiple storage devices.

But data can still face integrity risks as it traverses the public internet before reaching our servers. Issues such as faulty hardware on networks we don’t manage or client software bugs could potentially corrupt or drop data before Amazon S3 has a chance to validate it. Previously, you could extend the integrity protection by providing your own precomputed checksums with your PutObject or UploadPart requests. However, this requires configuring tools and applications to generate and track checksums, which can be complex to implement consistently across all your client applications uploading objects to Amazon S3.

The new default behavior builds upon existing data integrity protections without requiring any changes to your applications. Additionally, the new checksums are stored in the object’s metadata, making them accessible for integrity verification at any time.

Automatic client-side integrity protection
Amazon S3 now extends data integrity protection all the way to client-side applications by default. The latest versions of our AWS SDKs automatically calculate a cyclic redundancy check (CRC)-based checksum for each upload and send it to Amazon S3. Amazon S3 independently calculates a checksum on the server side and validates it against the provided value before durably storing the object and its checksum in the object’s metadata.

When your client application doesn’t send a CRC checksum (maybe it uses an old version of our SDK or you haven’t updated your application custom code yet), Amazon S3 computes a CRC-based checksum anyway and stores it in the object metadata for future reference. You can compare at a later stage the stored CRC with a CRC computed on your side and verify the network transmission was correct.

This new capability provides you with an automatic checksum calculation and validation for new uploads from the latest versions of the AWS SDKs, the AWS Command Line Interface (AWS CLI), and the AWS Management Console. You can also verify the checksum stored in the object’s metadata at any time. The new default data integrity protections use the existing CRC32 and CRC32C algorithms or the new CRC64NVME algorithm. Amazon S3 also provides developers with consistent full-object checksums across single-part and multipart uploads.

When uploading files in multiple parts, the SDKs calculate checksums for each part. Amazon S3 uses these checksums to verify the integrity of each part through the UploadPart API. Additionally, S3 validates the entire file’s size and checksum when you call the CompleteMultipartUpload API.

The CreateMultiPartUpload API introduces a new HTTP header, x-amz-checksum-type, which lets you specify the type of checksum to use. You can choose either a full object checksum (calculated by combining the checksums of all individual parts) or a composite checksum.

The full object checksum is stored with the object metadata for future reference. This new protection works seamlessly with server-side encryption. The consistent behavior across uploads, multipart uploads, downloads, and encryption modes simplifies client-side integrity checks. The ability to use full-object checksums to validate integrity and store them for use later can help you streamline your applications.

Let’s see it in action
To start using this additional integrity protection, update to the latest version of the AWS SDK or AWS CLI. No code changes are required to enable the new integrity protections.

Case 1: Amazon S3 now attaches a checksum to objects on the server side when objects are uploaded without a checksum

I wrote a simple Python script to upload and download content to and from an S3 bucket. I enabled maximum logging verbosity to see the actual HTTP headers sent to and from Amazon S3.

import boto3
import logging

BUCKET_NAME="aws-news-blog-20241111"
CONTENT='Hello World!'
OBJECT_NAME='test.txt'

# Enable debug logging for boto3 and botocore to stdout (this is verbose !!!)
logging.basicConfig(level=logging.DEBUG)

# create a s3 client
client = boto3.client('s3')

# put an object
client.put_object(Bucket=BUCKET_NAME, Key=OBJECT_NAME, Body=CONTENT)

# get the object 
response = client.get_object(Bucket=BUCKET_NAME, Key=OBJECT_NAME)
print(response['Body'].read().decode('utf-8'))

In the first step of this demo, I use an old AWS SDK for Python that doesn’t compute the CRC checksum on the client side. Despite this, I can observe that Amazon S3 now responds with a checksum it computed upon receiving the object.

S3 RESPONSE:
{
    ...
    "x-amz-checksum-crc64nvme": "AuUcyF784aU=",
    "x-amz-checksum-type": "FULL_OBJECT",
    ...
}

Case 2: Upload with manually pre-computed CRC64NVME checksum, a new checksum type

When I don’t have the option to use the latest version of the AWS SDK, or when I use my own code to upload objects to S3 buckets, I can compute the checksum and send it in the PutObject API request. Here is how I compute the checksum on my content before sending it to Amazon S3. To keep this code short, I use the checksums package available in the new AWS SDK for Python.

from awscrt import checksums
import base64

checksum = checksums.crc64nvme("Hello World!")
checksum_bytes = checksum.to_bytes(8, byteorder='big')  # CRC64 is 8 bytes
checksum_base64 = base64.b64encode(checksum_bytes)
print(checksum_base64)

And when I run it, I see the CRC64NVME checksum is the same as the one returned by Amazon S3 in the previous step.

$ python crc.py
b'AuUcyF784aU='

I can provide this checksum as part of the PutObject API call.

response = s3.put_object(
    Bucket=BUCKET_NAME,
    Key=OBJECT_NAME,
    Body=b'Hello World!',
    ChecksumAlgorithm='CRC64NVME', 
    ChecksumCRC64NVME=checksum_base64
)

Case 3: The new SDKs compute the checksum on the client-side

Now, I run the upload and download script again. This time, I use the latest version of the AWS SDK for Python. I observe that the SDK now sends the CRC headers in the request. The response also contains the checksum. I can easily compare the versions in the request and in the response to make sure the object received is the one I sent.

REQUEST:
{
    ...
    "x-amz-checksum-crc64nvme": "AuUcyF784aU=",
    "x-amz-checksum-type": "FULL_OBJECT",
    ... 
}

At any time, I can request the object checksum to verify the integrity of my local copy using the HeadObject or GetObject APIs.

 get_response = s3.get_object(
        Bucket=BUCKET_NAME,
        Key=OBJECT_NAME,
        ChecksumMode='ENABLED'
    )

The response object contains the checksum in the HTTPHeaders field.

{
...
    "x-amz-checksum-crc64nvme": "AuUcyF784aU=",
    "x-amz-checksum-type": "FULL_OBJECT",
...
}

Case 4: Multi-part uploads with new CRC-based whole-object checksum

When uploading large objects using the CreateMultipartUpload, UploadPart, and CompleteMultipartUpload APIs, the latest version of the SDK will automatically compute the checksums for you.

If you want to validate the integrity of your data by using a known content checksum, you can pre-compute the CRC-based whole-object checksum for multi-part uploads to simplify your client side tooling. When using full object checksums for multi-part uploads, you no longer have to keep track of part level checksums as you upload objects.


# precomputed CRC64NVME checksum for the full object
full_object_crc64_nvme_checksum = 'Naz0uXkYBPM='

# start multipart upload
create_response = s3.create_multipart_upload(
            Bucket=BUCKET_NAME,
            Key=OBJECT_NAME,
            ChecksumAlgorithm='CRC64NVME',
            ChecksumType='FULL_OBJECT'
        )
upload_id = create_response['UploadId']

# Upload parts
uploaded_parts = []

# part 1
data_part_1 = b'0' * (5 * 1024 * 1024) # minimum part size
upload_part_response_1 = s3.upload_part(
    Body=data_part_1,
    Bucket=BUCKET_NAME,
    Key=OBJECT_NAME,
    PartNumber=1,
    UploadId=upload_id,
    ChecksumAlgorithm='CRC64NVME'
)
uploaded_parts.append({'PartNumber': 1, 'ETag': upload_part_response_1['ETag']})

# part 2
data_part_2 = b'0' * (5 * 1024 * 1024)
upload_part_response_2 = s3.upload_part(
    Body=data_part_2,
    Bucket=BUCKET_NAME,
    Key=OBJECT_NAME,
    PartNumber=2,
    UploadId=upload_id,
    ChecksumAlgorithm='CRC64NVME'
)
uploaded_parts.append({'PartNumber': 2, 'ETag': upload_part_response_2['ETag']})

# Complete the multipart upload with the FULL_OBJECT CRC64NVME checksum to validate the integrity of your entire object. 
complete_response = s3.complete_multipart_upload(
            Bucket=BUCKET_NAME,
            Key=OBJECT_NAME,
            UploadId=upload_id,
            ChecksumCRC64NVME=full_object_crc64_nvme_checksum,
            ChecksumType='FULL_OBJECT',
            MultipartUpload={'Parts': uploaded_parts}
        )
print(complete_response)

Things to know
For your existing objects, the checksum will be added when you copy them. We updated the CopyObject API so you can choose the desired checksum algorithm for the destination object.

This new client-side checksum calculation is implemented in the latest version of the AWS SDKs. When you use an old SDK or custom code that doesn’t pre-compute checksums, Amazon S3 computes the checksum on all new objects it receives and stores it in the object’s metadata, even for multipart uploads.

Pricing and availability
This extended checksum computation and storage is available in all AWS Regions at no additional cost.

Update your AWS SDK and AWS CLI today to automatically benefit from this additional integrity protection for data in transit.

To learn more about data integrity protection on Amazon S3, visit Checking object integrity in the Amazon S3 User Guide.

— seb

from AWS News Blog https://ift.tt/qjyJ4Wa

Share this content:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top