Amazon S3

Integrating Amazon S3 as a Sink with GlassFlow

In this guide, you will learn how to integrate Amazon S3 cloud storage with GlassFlow using the GlassFlow Python SDK. You will build a custom integration and implement the sink logic in Python code to send the data to S3.

Prerequisites

Before you start, make sure you have the following:

A GlassFlow account. Sign up for a free GlassFlow account.
You have an AWS account.
You installed AWS CLI.
Python is installed on your machine.
Pip is installed to manage project packages.

Step 1: AWS Configuration

We use Boto3 (AWS SDK for Python) to store processed streaming data in AWS S3.

Create IAM user

Before using Boto3, you need to set up authentication credentials for your AWS account using either the IAM Console or the AWS CLI. You can either choose an existing root user or create a new one.

For instructions about how to create a user using the IAM Console, see Creating IAM users. Once the user has been created, see Managing access keys to learn how to create and retrieve the keys used to authenticate the user.

Copy both generated aws_access_key_id and aws_secret_access_key, you will use them when you configure the local credentials file and to set environment variables.

Configure credentials file

Use AWS CLI installed to configure your credentials file by running the below command:

aws configure

Alternatively, you can create the credentials file yourself. By default, its location is ~/.aws/credentials. in MacOS/Linux operating systems At a minimum, the credentials file should specify the access key and secret access key. In this example, the key and secret key for the account are specified in the default profile:

[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

You may also want to add a default region to the AWS configuration file, which is located by default at ~/.aws/config:

[default]
region=us-west-2

Alternatively, you can pass an AWS_REGION name as an environment variable when creating clients and resources.

You have now configured credentials for the default profile as well as a default region to use when creating connections. See Configuration for in-depth configuration sources and options.

Step 2: Create a GlassFlow Pipeline

Navigate to GlassFlow WebApp and log in with your credentials.
Click on Create New Pipeline.
Set the pipeline name to s3-data-pipeline.
Data Source: Select SDK or built-in integration as the data source. If using a built-in integration, follow the prompts to enter the required details (e.g. topic for Google Pub/Sub)
Transformation: Upload your transformation script (transform.py) to handle any data processing or enrichment. Or choose a default Echo function.
Data Sink: Select SDK as the data sink. You will configure the sink in your code to send the data to a specific S3 bucket.
Confirm Pipeline Creation: Copy the new Pipeline ID and Access Token.

Step 3: Setup a custom connector for S3

Clone the project

Start by cloning the glassflow-examples GitHub repository to your local machine.

git clone https://github.com/glassflow/glassflow-examples.git

Navigate to the project directory:

cd connectors/sink/aws-s3

Create a new virtual environment

Create a new virtual environment in the same folder and activate that environment:

python -m venv .venv && source .venv/bin/activate

Install libraries

Install the GlassFlow, AWS Boto3 Python SDK, and virtual environment package python-dotenvusing pip.

pip install glassflow python-dotenv boto3

Step 4: Create Environment Configuration File

Add a .env file in the project directory with the following configuration variables:

PIPELINE_ID=your_pipeline_id
PIPELINE_ACCESS_TOKEN=your_pipeline_access_token
AWS_REGION=your_aws_region
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
S3_BUCKET_NAME=your_s3_bucket_name

Replace placeholders with the actual values from your GlassFlow account and AWS credentials.

Step 5: Push Data to Amazon S3

Check an example Python script sink_connector.py that pushes data to your S3 bucket. Customize it if needed.

https://github.com/glassflow/glassflow-examples/blob/main/connectors/sink/aws-s3/sink_connector.py

Run the sink_connector.py script to push data to your S3 bucket:

python sink_connector.py

Summary

In this guide, you learned how to set up a data streaming pipeline with GlassFlow, integrating Amazon S3 using the GlassFlow custom connector. You configured the pipeline using SDKs for data sink, and you pushed sample data to S3 using a Python script.

Last updated 2 months ago