Integrating data from Amazon S3
Use data stored in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket
Abstract
Chapter 1. Providing feedback on Red Hat documentation
Let Red Hat know how we can make our documentation better. You can provide feedback directly from a documentation page by following the steps below.
- Make sure that you are logged in to the Customer Portal.
- Make sure that you are looking at the Multi-page HTML format of this document.
- Highlight the text that you want to provide feedback on. The Add Feedback prompt appears.
- Click Add Feedback.
- Enter your comments in the Feedback text box and click Submit.
Some ad blockers might impede your ability to provide feedback on Red Hat documentation. If you are using a web browser that has an ad blocker enabled and you are unable to leave feedback, consider disabling your ad blocker. For more information about how to disable your ad blocker, see the documentation for your web browser.
Red Hat automatically creates a tracking issue each time you submit feedback. Open the link that is displayed after you click Submit and start watching the issue, or add more comments to give us more information about the problem.
Thank you for taking the time to provide your feedback.
When working in a Jupyter Notebook, you may want to work with data stored in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket. This section covers commands and procedures for working with data stored in Amazon S3.
Chapter 2. Prerequisites
- A Jupyter server running on Red Hat OpenShift Data Science.
- Access to a Amazon Web Services S3 bucket.
-
Locate the
AWS Access Key IDandAWS Secret Access Keyfor your Amazon S3 account. - A Jupyter Notebook
Chapter 3. Creating an Amazon S3 client using notebook cells
To interact with data in Amazon S3 buckets, you must create a local client to handle requests to that service.
Prerequisites
- Access to a Jupyter notebook server running on Red Hat OpenShift Data Science.
-
Define values for the
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYenvironment variables when you start your notebook server, using the values from your Amazon Web Services account under My Security Credentials.
Procedure
In a new notebook cell, import the required libraries by adding the following:
import os import boto3 from boto3 import session
In another new notebook cell, define the following to create your session and client.
Define your credentials.
key_id = os.environ.get('AWS_ACCESS_KEY_ID') secret_key = os.environ.get('AWS_SECRET_ACCESS_KEY')Define the client session.
session = boto3.session.Session(aws_access_key_id=key_id, aws_secret_access_key=secret_key)
Define the client connection.
s3_client = boto3.client('s3', aws_access_key_id=key_id, aws_secret_access_key=secret_key)
Verification
Create a new cell and run an Amazon S3 command such as the following:
s3_client.list_buckets()
A successful response includes a
HTTPStatusCodeof200and a list ofBucketssimilar to the following:'Buckets': [{'Name': 'my-app-asdf3-image-registry-us-east-1-wbmlcvbasdfasdgvtsmkpt', 'CreationDate': datetime.datetime(2021, 4, 21, 6, 8, 52, tzinfo=tzlocal())}, {'Name': 'cf-templates-18rxasdfggawsvb-us-east-1', 'CreationDate': datetime.datetime(2021, 2, 15, 18, 35, 34, tzinfo=tzlocal())}
Chapter 4. Listing available Amazon S3 buckets using notebook cells
You can check which buckets you have access to by listing the buckets available to your account.
Prerequisites
- Configure an Amazon S3 client in a previous cell in the notebook. See Creating an Amazon S3 client using notebook cells for more information.
Procedure
Create a new notebook cell and use the
s3_clientto list available buckets.s3_client.list_buckets()
You can make this list of buckets easier to read by only printing the name, rather than the full response, for example:
for bucket in s3_client.list_buckets()['Buckets']: print(bucket['Name'])This returns output similar to the following:
my-app-asdf3-image-registry-us-east-1-wbmlcvbasdgasdgtkpt cf-templates-18rxuasgasgvb-us-east-1
Chapter 5. Listing files in available Amazon S3 buckets using notebook cells
You can check the files available in buckets you have access to by listing the objects in the bucket. Because buckets use object storage rather than a typical file system, object naming works differently from normal file naming. Objects in a bucket are always known by a key, which consists of the full path in the bucket plus the name of the file itself.
Prerequisites
- Configure an Amazon S3 client in a previous cell in the notebook. See Creating an Amazon S3 client using notebook cells for more information.
Procedure
Create a new notebook cell and list the objects in the bucket. For example:
bucket_name = 'std-user-bucket1' s3_client.list_objects_v2(Bucket=bucket_name)
This returns several objects in the following format:
{'Key': 'docker/registry/v2/blobs/sha256/00/0080913dd3f10aadb34asfgsgsdgasdga072049c93606b98bec84adb259b424f/data', 'LastModified': datetime.datetime(2021, 4, 22, 1, 26, 1, tzinfo=tzlocal()), 'ETag': '"6e02fad2deassadfsf900a4bd7344ffe"', 'Size': 4052, 'StorageClass': 'STANDARD'}You can make this list easier to read by printing only the key rather than the full response, for example:
bucket_name = 'std-user-bucket1' for key in s3_client.list_objects_v2(Bucket=bucket_name)['Contents']: print(key['Key'])This returns output similar to the following:
docker/registry/v2/blobs/sha256/00/0080913dd3f10aadb34asfgsgsdgasdga072049c93606b98bec84adb259b424f/data
You can also filter your query to list for a specific "path" or file name, for example:
bucket_name = 'std-user-bucket1' for key in s3_client.list_objects_v2(Bucket=bucket_name,Prefix='<start_of_file_path>')['Contents']: print(key['Key'])In the preceding example, replace
<start_of_file_path>with your own value.
Chapter 6. Downloading files from available Amazon S3 buckets using notebook cells
You can download a file to your notebook server using the download_file method.
Prerequisites
- Configure an Amazon S3 client in a previous cell in the notebook. See Creating an Amazon S3 client using notebook cells for more information.
Procedure
Define the following details in a notebook cell:
The bucket that the file is in. Replace
<name_of_the_bucket>with your own value.bucket_name = '<name_of_the_bucket>'The name of the file to download. Replace
<name_of_the_file_to_download>with your own value.file_name = '<name_of_the_file_to_download>' # Full path from the bucketThe name that you want the file to have after it is downloaded. This can be a full path, a relative path, or just a new file name. Replace
<name_of_the_file_when_downloaded>with your own value.new_file_name = '<name_of_the_file_when_downloaded>'
Download the file, specifying the previous variables as arguments.
s3_client.download_file(bucket_name, file_name, new_file_name)
NoteIf you want to retrieve a file as an object that you can then stream as a standard file using the read() method, refer to the Amazon Sev Services get object command reference.
Chapter 7. Uploading files to available Amazon S3 buckets using notebook cells
You can upload files from your notebook server to an Amazon S3 bucket by using the upload_file method.
Prerequisites
- Configure an Amazon S3 client in a previous cell in the notebook. See Creating an Amazon S3 client using notebook cells for more information.
Procedure
Define the following details in a notebook cell:
The name of the file to upload. This must include the full local path to the file. Replace
<name_of_the_file_to_upload>with your own value.file_name = '<name_of_the_file_to_upload>'The name of the bucket to upload the file to. Replace
<name_of_the_bucket>with your own value.bucket_name = '<name_of_the_bucket>'The full key to use to save the file to the bucket. Replace
<full_path_and_file_name>with your own value.key = '<full_path_and_file_name>'
Upload the file, specifying the previous variables as arguments.
s3_client.upload_file(file_name, bucket_name, key)