in

Boto3 vs AWS Wrangler: Simplifying S3 Operations with Python | by Antonello Benedetto | Jun, 2023


A comparative evaluation for AWS S3 growth

Image by Hemerson Coelho on Unsplash

On this tutorial, we are going to delve into the world of AWS S3 growth with Python by exploring and evaluating two highly effective libraries: boto3 and awswrangler.

When you’ve ever questioned

“What’s the greatest Python instrument to work together with AWS S3 Buckets? “

“How you can carry out S3 operations in essentially the most environment friendly means?”

then you definitely’ve come to the best place.

Certainly, all through this put up, we are going to cowl a variety of frequent operations important for working with AWS S3 buckets amongst which:

  1. itemizing objects,
  2. checking object existence,
  3. downloading objects,
  4. importing objects,
  5. deleting objects,
  6. writing objects,
  7. studying objects (normal means or with SQL)

By evaluating the 2 libraries, we are going to establish their similarities, variations, and optimum use circumstances for every operations. By the tip, you should have a transparent understanding of which library is healthier suited to particular S3 duties.

Moreover, for individuals who learn to the very backside, we may even discover how you can leverage boto3 and awswrangler to learn information from S3 utilizing pleasant SQL queries.

So let’s dive in and uncover the most effective instruments for interacting with AWS S3 and discover ways to carry out these operations effectively with Python utilizing each libraries.

The bundle variations used on this tutorial are:

  • boto3==1.26.80
  • awswrangler==2.19.0

Additionally, three preliminary information together with randomly generated account_balances information have been uploaded to an S3 bucket named coding-tutorials:

Regardless of you have to be conscious that various methods exists to ascertain a connection to a S3 bucket, on this case, we’re going to use the setup_default_session() from boto3:

# CONNECTING TO S3 BUCKET
import os
import io
import boto3
import awswrangler as wr
import pandas as pd

boto3.setup_default_session(aws_access_key_id = 'your_access_key',
aws_secret_access_key = 'your_secret_access_key')

bucket = 'coding-tutorials'

This methodology is helpful as, as soon as the session has been set, it may be shared by each boto3 and awswrangler, that means that we received’t must cross any extra secrets and techniques down the street

Now let’s examine boto3 and awswrangler whereas performing various frequent operations and discover what’s the most effective instrument for the job.

The complete pocket book together with the code that follows will be present in this GitHub folder.

# 1 Itemizing Objects

Itemizing objects, might be the primary operation we must always carry out whereas exploring a brand new S3 bucket and is an easy option to test whether or not a session has been accurately set.

With boto3 objects will be listed utilizing:

  • boto3.consumer('s3').list_objects()
  • boto3.useful resource('s3').Bucket().objects.all()
print('--BOTO3--') 
# BOTO3 - Most well-liked Methodology
consumer = boto3.consumer('s3')

for obj in consumer.list_objects(Bucket=bucket)['Contents']:
print('File Identify:', obj['Key'], 'Measurement:', spherical(obj['Size']/ (1024*1024), 2), 'MB')

print('----')
# BOTO3 - Different Methodology
useful resource = boto3.useful resource('s3')

for obj in useful resource.Bucket(bucket).objects.all():
print('File Identify:', obj.key, 'Measurement:', spherical(obj.measurement/ (1024*1024), 2), 'MB')

Regardless of each consumer and useful resource lessons do an honest job, the consumer class ought to be most well-liked, as it’s extra elegant and supplies a lot of [easily accessible] low-level metadata as a nested JSON ( amongst which the thing measurement).

Alternatively, awswrangler solely supplies a single methodology to record objects:

Being a high-level methodology, this doesn’t return any low-level metadata concerning the object, such that to seek out the file measurement we have to name:

print('--AWS_WRANGLER--') 
# AWS WRANGLER

for obj in wr.s3.list_objects("s3://coding-tutorials/"):
print('File Identify:', obj.exchange('s3://coding-tutorials/', ''))

print('----')
for obj, measurement in wr.s3.size_objects("s3://coding-tutorials/").gadgets():
print('File Identify:', obj.exchange('s3://coding-tutorials/', '') , 'Measurement:', spherical(measurement/ (1024*1024), 2), 'MB')

The code above returns:

Comparability → Boto3 Wins

Regardless of awswrangler is extra simple to make use of, boto3 wins the problem, whereas itemizing S3 objects. In reality, its low-level implementation, signifies that many extra objects metadata will be retrieved utilizing certainly one of its lessons. Such data is extraordinarily helpful whereas accessing S3 bucket in a programmatic means.

# 2 Checking Object Existence

The power to test objects existence is required after we want for extra operations to be triggered because of an object being already out there in S3 or not.

With boto3 such checks will be carried out utilizing:

  • boto3.consumer('s3').head_object()
object_key = 'account_balances_jan2023.parquet'

# BOTO3
print('--BOTO3--')
consumer = boto3.consumer('s3')
strive:
consumer.head_object(Bucket=bucket, Key = object_key)
print(f"The article exists within the bucket {bucket}.")
besides consumer.exceptions.NoSuchKey:
print(f"The article doesn't exist within the bucket {bucket}.")

As a substitute awswrangler supplies the devoted methodology:

  • wr.s3.does_object_exist()
# AWS WRANGLER
print('--AWS_WRANGLER--')
strive:
wr.s3.does_object_exist(f's3://{bucket}/{object_key}')
print(f"The article exists within the bucket {bucket}.")
besides:
print(f"The article doesn't exist within the bucket {bucket}.")

The code above returns:

Comparability → AWSWrangler Wins

Let’s admit it: boto3 methodology identify [head_object()] shouldn’t be that intuitive.

Additionally having a devoted methodology is undoubtedly and benefit of awswrangler that wins this match.

# 3 Downloading Objects

Downloading objects in native is very simple with each boto3 and awswrangler utilizing the next strategies:

  • boto3.consumer('s3').download_file() or
  • wr.s3.obtain()

The one distinction is that download_file() takes bucket , object_key and local_file as enter variables, whereas obtain() solely requires the S3 path and local_file :

object_key = 'account_balances_jan2023.parquet'

# BOTO3
consumer = boto3.consumer('s3')
consumer.download_file(bucket, object_key, 'tmp/account_balances_jan2023_v2.parquet')

# AWS WRANGLER
wr.s3.obtain(path=f's3://{bucket}/{object_key}', local_file='tmp/account_balances_jan2023_v3.parquet')

When the code is executed, each variations of the identical object are certainly downloaded in native contained in the tmp/ folder:

Comparability → Draw

We will think about each libraries being equal so long as downloading information is anxious, subsequently let’s name it a draw.

# 4 Uploading Objects

Similar reasoning applies whereas importing information from native setting to S3. The strategies that may be employed are:

  • boto3.consumer('s3').upload_file() or
  • wr.s3.add()
object_key_1 = 'account_balances_apr2023.parquet'
object_key_2 = 'account_balances_may2023.parquet'

file_path_1 = os.path.dirname(os.path.realpath(object_key_1)) + '/' + object_key_1
file_path_2 = os.path.dirname(os.path.realpath(object_key_2)) + '/' + object_key_2

# BOTO3
consumer = boto3.consumer('s3')
consumer.upload_file(file_path_1, bucket, object_key_1)

# AWS WRANGLER
wr.s3.add(local_file=file_path_2, path=f's3://{bucket}/{object_key_2}')

Executing the code, uploads two new account_balances objects (for the months of April and Could 2023) to the coding-tutorials bucket:

Comparability → Draw

That is one other draw. Up to now there’s absolute parity between the 2 libraries!

# 5 Deleting Objects

Let’s now assume we wished to delete the next objects:

#SINGLE OBJECT
object_key = ‘account_balances_jan2023.parquet’

#MULTIPLE OBJECTS
object_keys = [‘account_balances_jan2023.parquet’,
‘account_balances_feb2023.parquet’,
‘account_balances_mar2023.parquet’]

boto3 permits to delete objects one-by-one or in bulk utilizing the next strategies:

  • boto3.consumer('s3').delete_object()
  • boto3.consumer('s3').delete_objects()

Each strategies return a response together with ResponseMetadata that can be utilized to confirm whether or not objects have been deleted efficiently or not. As an illustration:

  • whereas deleting a single object, a HTTPStatusCode==204 signifies that the operation has been accomplished efficiently (if objects are discovered within the S3 bucket);
  • whereas deleting a number of objects, a Deleted record is returned with the names of efficiently deleted gadgets.
# BOTO3
print('--BOTO3--')
consumer = boto3.consumer('s3')

# Delete Single object
response = consumer.delete_object(Bucket=bucket, Key=object_key)
deletion_date = response['ResponseMetadata']['HTTPHeaders']['date']

if response['ResponseMetadata']['HTTPStatusCode'] == 204:
print(f'Object {object_key} deleted efficiently on {deletion_date}.')
else:
print(f'Object couldn't be deleted.')

# Delete A number of Objects
objects = [{'Key': key} for key in object_keys]

response = consumer.delete_objects(Bucket=bucket, Delete={'Objects': objects})
deletion_date = response['ResponseMetadata']['HTTPHeaders']['date']

if len(object_keys) == len(response['Deleted']):
print(f'All objects had been deleted efficiently on {deletion_date}')
else:
print(f'Object couldn't be deleted.')

Alternatively, awswrangler supplies a way that can be utilized for each single and bulk deletions:

Since object_keys will be recursively handed to the tactic as a list_comprehension as an alternative of being transformed to a dictionary first like earlier than – utilizing this syntax is an actual pleasure.

# AWS WRANGLER
print('--AWS_WRANGLER--')
# Delete Single object
wr.s3.delete_objects(path=f's3://{bucket}/{object_key}')

# Delete A number of Objects
strive:
wr.s3.delete_objects(path=[f's3://{bucket}/{key}' for key in object_keys])
print('All objects deleted efficiently.')
besides:
print(f'Objects couldn't be deleted.')

Executing the code above, deletes the objects in S3 after which returns:

Comparability → Boto3 Wins

That is difficult one: awswrangler has a less complicated syntax to make use of whereas deleting a number of objects, as we will merely cross the total record to the tactic.

Nevertheless boto3 returns a lot of data within the responsewhich can be extraordinarily helpful logs, whereas deleting objects programmatically.

As a result of in a manufacturing setting, low-level metadata is healthier than virtually no metadata, boto3 wins this problem and now leads 2–1.

# 6 Writing Objects

In the case of write information to S3, boto3 doesn’t even present an out-of-the-box methodology to carry out such operations.

For instance, if we wished to create a brand new parquet file utilizing boto3, we’d first must persist the thing on the native disk (utilizing to_parquet() methodology from pandas) after which add it to S3 utilizing the upload_fileobj() methodology.

In a different way from upload_file() (explored at level 4) the upload_fileobj() methodology is a managed transfer which will perform a multipart upload in multiple threads, if necessary:

object_key_1 = 'account_balances_june2023.parquet'

# RUN THE GENERATOR.PY SCRIPT

df.to_parquet(object_key_1)

# BOTO3
consumer = boto3.consumer('s3')

# Add the Parquet file to S3
with open(object_key_1, 'rb') as file:
consumer.upload_fileobj(file, bucket, object_key_1)

Alternatively, one of many principal benefits of the awswrangler library (whereas working with pandas) , is that it may be used to put in writing objects on to the S3 bucket (with out saving them to the native disk) that’s each elegant and environment friendly.

Furthermore, awswrangler offers great flexibility permitting customers to:

  • Apply particular compression algorithms like snappy , gzip and zstd;
  • append to or overwrite current information through the mode parameter when dataset = True;
  • Specify a number of partitions columns through the partitions_col parameter.
object_key_2 = 'account_balances_july2023.parquet'

# AWS WRANGLER
wr.s3.to_parquet(df=df,
path=f's3://{bucket}/{object_key_2}',
compression = 'gzip',
partition_cols = ['COMPANY_CODE'],
dataset=True)

As soon as executed, the code above writes account_balances_june2023 as a single parquet file, and account_balances_july2023 as a folder with 4 information already partitioned by COMPANY_CODE:

Comparability → AWSWrangler Wins

If working with pandas is an possibility, awswrangler presents a way more superior set of operations whereas writing information to S3, notably when in comparison with boto3 that on this case, shouldn’t be precisely the most effective instrument for the job.

# 7.1 Reading Objects (Python)

As related reasoning applies whereas making an attempt to learn objects from S3 utilizing boto3: since this library doesn’t provide a inbuilt learn methodology, the best choice we’ve is to carry out an API name (get_object()), learn the Physique of the response after which cross the parquet_object to pandas.

Word that pd.read_parquet() methodology expects a file-like object as enter, which is why we have to cross the content material learn from the parquet_object as a binary stream.

Certainly, through the use of io.BytesIO() we create a brief file-like object in reminiscence, avoiding the necessity to save the Parquet file domestically earlier than studying it. That is in flip improves efficiency, particularly when working with giant information:

object_key = 'account_balances_may2023.parquet'

# BOTO3
consumer = boto3.consumer('s3')

# Learn the Parquet file
response = consumer.get_object(Bucket=bucket, Key=object_key)
parquet_object = response['Body'].learn()

df = pd.read_parquet(io.BytesIO(parquet_object))
df.head()

As anticipated, awswrangler as an alternative excels at studying objects from S3, returning a pandas df as an output.

It helps various enter codecs like csv, json, parquet and extra not too long ago delta tables. Additionally passing the chunked parameter permits to learn objects in a memory-friendly means:

# AWS WRANGLER
df = wr.s3.read_parquet(path=f's3://{bucket}/{object_key}')
df.head()

# wr.s3.read_csv()
# wr.s3.read_json()
# wr.s3.read_parquet_table()
# wr.s3.read_deltalake()

Executing the code above returns a pandas df with Could information:

Comparability → AWSWrangler Wins

Sure, there are methods across the lack of correct strategies in boto3. Nevertheless, awswrangler is a library conceived to learn S3 objects effectively, therefore it additionally wins this problem.

# 7.2 Studying Objects (SQL)

Those that managed to learn till this level deserve a bonus and that bonus is studying objects from S3 utilizing plain SQL.

Let’s suppose we wished to fetch information from the account_balances_may2023.parquet object utilizing the question under (that filters information by AS_OF_DATE):

object_key = 'account_balances_may2023.parquet'
question = """SELECT * FROM s3object s
WHERE AS_OF_DATE > CAST('2023-05-13T' AS TIMESTAMP)"""

In boto3 this may be achieved through the select_object_content() methodology. Word how we must also specify the inputSerialization and OutputSerialization codecs:

# BOTO3
consumer = boto3.consumer('s3')

resp = consumer.select_object_content(
Bucket=bucket,
Key=object_key,
Expression= question,
ExpressionType='SQL',
InputSerialization={"Parquet": {}},
OutputSerialization={'JSON': {}},
)

information = []

# Course of the response
for occasion in resp['Payload']:
if 'Information' in occasion:
information.append(occasion['Records']['Payload'].decode('utf-8'))

# Concatenate the JSON information right into a single string
json_string = ''.be a part of(information)

# Load the JSON information right into a Pandas DataFrame
df = pd.read_json(json_string, strains=True)

# Print the DataFrame
df.head()

If working with pandas df is an possibility, awswrangler additionally presents a really helpful select_query() methodology that requires minimal code:

# AWS WRANGLER
df = wr.s3.select_query(
sql=question,
path=f's3://{bucket}/{object_key}',
input_serialization="Parquet",
input_serialization_params={}
)
df.head()

For each libraries, the returned df will appear like this:

On this tutorial we explored 7 frequent operations that may be carried out on S3 buckets and run a comparative evaluation between boto3 and awswrangler libraries.

Each approaches enable us to work together with S3 buckets, nonetheless the principle distinction is that the boto3 consumer supplies low-level entry to AWS companies, whereas awswrangler presents a simplified and extra high-level interface for numerous information engineering duties.

Total, awswrangler is our winner with 3 factors (checking objects existence, write objects, learn objects) vs 2 factors scored by boto3 (itemizing object, deleting objects). Each the add/obtain objects classes had been attracts and didn’t assign factors.

Regardless of the outcome above, the reality is that each libraries give their greatest when used interchangeably, to excel within the duties they’ve been constructed for.

Sources


High-quality-tune MPT-7B on Amazon SageMaker | by João Pereira | Jun, 2023

Word2Vec, GloVe, and FastText, Defined | by Ajay Halthor | Jun, 2023