amazon web services - Python boto3: download files from s3 to local only if there are differences between s3 files and local one
- c - Solaris 10 make Error code 1 Fatal Error when trying to build python 2.7.16 - Stack Overflow 推荐度:
- javascript - How to dismiss a phonegap notification programmatically - Stack Overflow 推荐度:
- javascript - Get the JSON objects that are not present in another array - Stack Overflow 推荐度:
- javascript - VS 2015 Angular 2 import modules cannot be resolved - Stack Overflow 推荐度:
- javascript - Type 'undefined' is not assignable to type 'menuItemProps[]' - Stack Overflow 推荐度:
- 相关推荐
i have the following code that download files from s3 to local. However, i cannot figure out how to download only if s3 files are different from and more updated than the local ones. What is the best way to do this ? Is it based on modified time or ETags or MD5 or all of these?
import boto3
import pathlib
BUCKET_NAME = 'testing'
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket = BUCKET_NAME, Prefix = KEY)
if 'Contents' in response:
for obj in response['Contents']:
file_key = obj['Key']
file_name = os.path.basename(file_key) # Get the file name from the key
local_file_path = os.path.join(f'test_dir', file_name)
#Download the file
s3_client.download_file(BUCKET_NAME, file_key, local_file_path)
print(f"Downloaded {file_name}")
i have the following code that download files from s3 to local. However, i cannot figure out how to download only if s3 files are different from and more updated than the local ones. What is the best way to do this ? Is it based on modified time or ETags or MD5 or all of these?
import boto3
import pathlib
BUCKET_NAME = 'testing'
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket = BUCKET_NAME, Prefix = KEY)
if 'Contents' in response:
for obj in response['Contents']:
file_key = obj['Key']
file_name = os.path.basename(file_key) # Get the file name from the key
local_file_path = os.path.join(f'test_dir', file_name)
#Download the file
s3_client.download_file(BUCKET_NAME, file_key, local_file_path)
print(f"Downloaded {file_name}")
Share
Improve this question
edited 8 hours ago
John Rotenstein
268k28 gold badges441 silver badges529 bronze badges
Recognized by AWS
asked yesterday
user1769197user1769197
2,2136 gold badges21 silver badges39 bronze badges
3
|
2 Answers
Reset to default 1Based on the official documentation, list_objects_v2 returns an ample amount of information about the files stored in your bucket. The response contains elements such as the total number of keys returned and the common prefixes, but the most important is the Contents
list, which stores data for each individual object inside a bucket.
From what I see, Contents
has a field called LastModified
of type datetime
. I think you could use it to check if a file is updated or not, in order to avoid checking the actual content of the local object against the remote one (which I really don't recommend).
I'd suggest to keep a (local) database of metadata about your S3 objects, containing elements such as Key
and LastModified
. Store your files locally, in a predefined folder, and make sure your files have a name that can be deduced from the database information (maybe name them after the Key
).
You won't even need to pass through the contents of the files, in order to check if a file was updated. Just query your database, query the S3 API using list_objects_v2, and check the dates for each file. If they do not correspond, download the newer version of the file.
In this manner, you could also check for missing files inside your local repository. If there are any additional keys fetched from the API, you could easily see which of them don't exist in your database and retrieve them.
P.S. A deleted answer suggested using the operating system's library functions to check for the file's last modified date. It's a great idea. But if you need performance, iterating over the metadata stored in a database could be faster than using operating system functions to read files from a directory.
If you're just concerned with downloading new files and files that have been modified remotely, you can check to see if the file exists locally, and if it does, if it has a different size or newer timestamp than the local one. That should get most cases where a remote file is added or changes.
import boto3
import os
from datetime import datetime
s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
# Use a paginator to handle cases with more than 1000 objects
for page in paginator.paginate(Bucket=BUCKET_NAME, Prefix=KEY):
for obj in page.get('Contents', []):
key = obj['Key']
# Turn the key into a name, using '/' on S3 as the path delimiter locally
local_name = os.path.join(*key[len(KEY):].split("/"))
local_name = os.path.join(TARGET_DIR, local_name)
changed = False
if not os.path.isfile(local_name):
# The file does not exist locally
changed = True
elif os.path.getsize(local_name) != obj['Size']:
# The local file is a different size
changed = True
elif datetime.fromtimestamp(os.path.getmtime(local_name)) < obj['LastModified'].replace(tzinfo=None):
# The local file is older than the remote file
changed = True
if changed:
if not os.path.isdir(os.path.dirname(local_name)):
# Need to make the directory to mirror the prefix of the object
os.makedirs(os.path.dirname(local_name))
# Download the file
s3_client.download_file(BUCKET_NAME, key, local_name)
print(f"Downloaded {local_name}")
- 虚拟现实技术2015年将迎来爆发
- 2013年科技行业推出的失败产品(组图)
- windows 10 - Gamemaker Mobile device Inconsistent Drag Speed Across Different Operating Systems (Win7 vs. Win10) - Stack Overflo
- perl - How to embed Teraterm in a Visual Studio project - Stack Overflow
- github - How do if fix theses syntax errors with this bash script - Stack Overflow
- node.js - Mongoose schema worked with the main db but not with test db - Stack Overflow
- javascript - Eris read embeds - Stack Overflow
- python - DocTR wrong number detection - Stack Overflow
- amazon web services - How to create a CloudWatch alarm for an EventBridge Pipe's stopped state in AWS? - Stack Overflow
- Laravel Livewire Pagination is not responsive - Stack Overflow
- How to find kafka broker username and password - Stack Overflow
- azure managed identity - Durable Function error "This request is not authorized to perform this operation" - S
- c++ - Microbenchmark - backward iteration results in fewer cache misses - Stack Overflow
- typescript - Declare object and internally refer to other values - Stack Overflow
- c++ - Vscode doesn't pass args to program when debugging - Stack Overflow
- Can't swap camera with AVCaptureMultiCamSession with SwiftAVKit for iOS - Stack Overflow
- javascript - How can I create hyperlinkOrPicture column types on SharePoint Lists with Microsoft Graph API or Sharepoint API? -
aws s3 sync
can be used to do this. – jarmod Commented 18 hours ago