Integrating data from Amazon S3
Use data stored in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket
Abstract
Preface
When working in a Jupyter Notebook, you may want to work with data stored in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket. This section covers commands and procedures for working with data stored in Amazon S3.
Chapter 1. Prerequisites
- A Jupyter server running on Red Hat OpenShift Data Science.
- Access to a Amazon Web Services S3 bucket.
-
Locate the
AWS Access Key ID
andAWS Secret Access Key
for your Amazon S3 account. - A Jupyter Notebook.
Chapter 2. Creating an Amazon S3 client using notebook cells
To interact with data in Amazon S3 buckets, you must create a local client to handle requests to that service.
Prerequisites
- Access to a Jupyter notebook server running on Red Hat OpenShift Data Science.
-
Define values for the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
environment variables when you start your notebook server, using the values from your Amazon Web Services account under My Security Credentials.
Procedure
In a new notebook cell, import the required libraries by adding the following:
import os import boto3 from boto3 import session
In another new notebook cell, define the following to create your session and client.
Define your credentials.
key_id = os.environ.get('AWS_ACCESS_KEY_ID') secret_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
Define the client session.
session = boto3.session.Session(aws_access_key_id=key_id, aws_secret_access_key=secret_key)
Define the client connection.
s3_client = boto3.client('s3', aws_access_key_id=key_id, aws_secret_access_key=secret_key)
Verification
Create a new cell and run an Amazon S3 command such as the following:
s3_client.list_buckets()
A successful response includes a
HTTPStatusCode
of200
and a list ofBuckets
similar to the following:'Buckets': [{'Name': 'my-app-asdf3-image-registry-us-east-1-wbmlcvbasdfasdgvtsmkpt', 'CreationDate': datetime.datetime(2021, 4, 21, 6, 8, 52, tzinfo=tzlocal())}, {'Name': 'cf-templates-18rxasdfggawsvb-us-east-1', 'CreationDate': datetime.datetime(2021, 2, 15, 18, 35, 34, tzinfo=tzlocal())}
Chapter 3. Listing available Amazon S3 buckets using notebook cells
You can check which buckets you have access to by listing the buckets available to your account.
Prerequisites
- Configure an Amazon S3 client in a previous cell in the notebook. See Creating an Amazon S3 client using notebook cells for more information.
Procedure
Create a new notebook cell and use the
s3_client
to list available buckets.s3_client.list_buckets()
You can make this list of buckets easier to read by only printing the name, rather than the full response, for example:
for bucket in s3_client.list_buckets()['Buckets']: print(bucket['Name'])
This returns output similar to the following:
my-app-asdf3-image-registry-us-east-1-wbmlcvbasdgasdgtkpt cf-templates-18rxuasgasgvb-us-east-1
Chapter 4. Listing files in available Amazon S3 buckets using notebook cells
You can check the files available in buckets you have access to by listing the objects in the bucket. Because buckets use object storage rather than a typical file system, object naming works differently from normal file naming. Objects in a bucket are always known by a key, which consists of the full path in the bucket plus the name of the file itself.
Prerequisites
- Configure an Amazon S3 client in a previous cell in the notebook. See Creating an Amazon S3 client using notebook cells for more information.
Procedure
Create a new notebook cell and list the objects in the bucket. For example:
bucket_name = 'std-user-bucket1' s3_client.list_objects_v2(Bucket=bucket_name)
This returns several objects in the following format:
{'Key': 'docker/registry/v2/blobs/sha256/00/0080913dd3f10aadb34asfgsgsdgasdga072049c93606b98bec84adb259b424f/data', 'LastModified': datetime.datetime(2021, 4, 22, 1, 26, 1, tzinfo=tzlocal()), 'ETag': '"6e02fad2deassadfsf900a4bd7344ffe"', 'Size': 4052, 'StorageClass': 'STANDARD'}
You can make this list easier to read by printing only the key rather than the full response, for example:
bucket_name = 'std-user-bucket1' for key in s3_client.list_objects_v2(Bucket=bucket_name)['Contents']: print(key['Key'])
This returns output similar to the following:
docker/registry/v2/blobs/sha256/00/0080913dd3f10aadb34asfgsgsdgasdga072049c93606b98bec84adb259b424f/data
You can also filter your query to list for a specific "path" or file name, for example:
bucket_name = 'std-user-bucket1' for key in s3_client.list_objects_v2(Bucket=bucket_name,Prefix='<start_of_file_path>')['Contents']: print(key['Key'])
In the preceding example, replace
<start_of_file_path>
with your own value.
Chapter 5. Downloading files from available Amazon S3 buckets using notebook cells
You can download a file to your notebook server using the download_file
method.
Prerequisites
- Configure an Amazon S3 client in a previous cell in the notebook. See Creating an Amazon S3 client using notebook cells for more information.
Procedure
Define the following details in a notebook cell:
The bucket that the file is in. Replace
<name_of_the_bucket>
with your own value.bucket_name = '<name_of_the_bucket>'
The name of the file to download. Replace
<name_of_the_file_to_download>
with your own value.file_name = '<name_of_the_file_to_download>' # Full path from the bucket
The name that you want the file to have after it is downloaded. This can be a full path, a relative path, or just a new file name. Replace
<name_of_the_file_when_downloaded>
with your own value.new_file_name = '<name_of_the_file_when_downloaded>'
Download the file, specifying the previous variables as arguments.
s3_client.download_file(bucket_name, file_name, new_file_name)
NoteIf you want to retrieve a file as an object that you can then stream as a standard file using the read() method, refer to the Amazon Web Services get object command reference.
Chapter 6. Uploading files to available Amazon S3 buckets using notebook cells
You can upload files from your notebook server to an Amazon S3 bucket by using the upload_file
method.
Prerequisites
- Configure an Amazon S3 client in a previous cell in the notebook. See Creating an Amazon S3 client using notebook cells for more information.
Procedure
Define the following details in a notebook cell:
The name of the file to upload. This must include the full local path to the file. Replace
<name_of_the_file_to_upload>
with your own value.file_name = '<name_of_the_file_to_upload>'
The name of the bucket to upload the file to. Replace
<name_of_the_bucket>
with your own value.bucket_name = '<name_of_the_bucket>'
The full key to use to save the file to the bucket. Replace
<full_path_and_file_name>
with your own value.key = '<full_path_and_file_name>'
Upload the file, specifying the previous variables as arguments.
s3_client.upload_file(file_name, bucket_name, key)