amazon web services - Python boto3: download files from s3 to local only if there are differences between s3 files and local one

时间: 2025-01-06 admin 业界

i have the following code that download files from s3 to local. However, i cannot figure out how to download only if s3 files are different from and more updated than the local ones. What is the best way to do this ? Is it based on modified time or ETags or MD5 or all of these?

import boto3
import pathlib

BUCKET_NAME = 'testing'
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket = BUCKET_NAME, Prefix = KEY)
        
if 'Contents' in response:
     for obj in response['Contents']:
         file_key = obj['Key']
         file_name = os.path.basename(file_key)  # Get the file name from the key
         local_file_path = os.path.join(f'test_dir', file_name)
         #Download the file
         s3_client.download_file(BUCKET_NAME, file_key, local_file_path)
         print(f"Downloaded {file_name}")

i have the following code that download files from s3 to local. However, i cannot figure out how to download only if s3 files are different from and more updated than the local ones. What is the best way to do this ? Is it based on modified time or ETags or MD5 or all of these?

import boto3
import pathlib

BUCKET_NAME = 'testing'
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket = BUCKET_NAME, Prefix = KEY)
        
if 'Contents' in response:
     for obj in response['Contents']:
         file_key = obj['Key']
         file_name = os.path.basename(file_key)  # Get the file name from the key
         local_file_path = os.path.join(f'test_dir', file_name)
         #Download the file
         s3_client.download_file(BUCKET_NAME, file_key, local_file_path)
         print(f"Downloaded {file_name}")
Share Improve this question edited 8 hours ago John Rotenstein 268k28 gold badges441 silver badges529 bronze badges Recognized by AWS asked yesterday user1769197user1769197 2,2136 gold badges21 silver badges39 bronze badges 3
  • It's up to you to define the logic depending on your needs. Also, note that the code shown will at most only gather the first 1000 objects from the bucket. – Anon Coward Commented yesterday
  • Be aware that aws s3 sync can be used to do this. – jarmod Commented 18 hours ago
  • yup i understand that command line can do this automatically but i am looking for a python solution. – user1769197 Commented 10 hours ago
Add a comment  | 

2 Answers 2

Reset to default 1

Based on the official documentation, list_objects_v2 returns an ample amount of information about the files stored in your bucket. The response contains elements such as the total number of keys returned and the common prefixes, but the most important is the Contents list, which stores data for each individual object inside a bucket.

From what I see, Contents has a field called LastModified of type datetime. I think you could use it to check if a file is updated or not, in order to avoid checking the actual content of the local object against the remote one (which I really don't recommend).

I'd suggest to keep a (local) database of metadata about your S3 objects, containing elements such as Key and LastModified. Store your files locally, in a predefined folder, and make sure your files have a name that can be deduced from the database information (maybe name them after the Key).

You won't even need to pass through the contents of the files, in order to check if a file was updated. Just query your database, query the S3 API using list_objects_v2, and check the dates for each file. If they do not correspond, download the newer version of the file.

In this manner, you could also check for missing files inside your local repository. If there are any additional keys fetched from the API, you could easily see which of them don't exist in your database and retrieve them.

P.S. A deleted answer suggested using the operating system's library functions to check for the file's last modified date. It's a great idea. But if you need performance, iterating over the metadata stored in a database could be faster than using operating system functions to read files from a directory.

If you're just concerned with downloading new files and files that have been modified remotely, you can check to see if the file exists locally, and if it does, if it has a different size or newer timestamp than the local one. That should get most cases where a remote file is added or changes.

import boto3
import os
from datetime import datetime

s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
# Use a paginator to handle cases with more than 1000 objects
for page in paginator.paginate(Bucket=BUCKET_NAME, Prefix=KEY):
    for obj in page.get('Contents', []):
        key = obj['Key']
        # Turn the key into a name, using '/' on S3 as the path delimiter locally
        local_name = os.path.join(*key[len(KEY):].split("/"))
        local_name = os.path.join(TARGET_DIR, local_name)

        changed = False
        if not os.path.isfile(local_name):
            # The file does not exist locally
            changed = True
        elif os.path.getsize(local_name) != obj['Size']:
            # The local file is a different size
            changed = True
        elif datetime.fromtimestamp(os.path.getmtime(local_name)) < obj['LastModified'].replace(tzinfo=None):
            # The local file is older than the remote file
            changed = True

        if changed:
            if not os.path.isdir(os.path.dirname(local_name)):
                # Need to make the directory to mirror the prefix of the object
                os.makedirs(os.path.dirname(local_name))
            # Download the file
            s3_client.download_file(BUCKET_NAME, key, local_name)
            print(f"Downloaded {local_name}")