S3
Amazon Simple Storage Service S3 is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface.Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e-commerce network. Amazon S3 can store any type of object, which allows use cases like storage for Internet applications, backups, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage.
Bruin supports S3 via Ingestr assets, and you can use it to move data to and from your data warehouse.
Reading data from S3
In order to set up the S3 connection, you need to add a configuration item in the .bruin.yml
file and in the asset
file. You will need the access_key_id
and secret_access_key
. For details on how to obtain these credentials, please refer here.
Follow the steps below to correctly set up S3 as a data source and run ingestion.
Step 1: Add a connection to .bruin.yml file
To connect to S3, you need to add a configuration item to the connections section of the .bruin.yml
file. This configuration must comply with the following schema:
connections:
s3:
- name: "my-s3"
access_key_id: "AKI_123"
secret_access_key: "L6L_123"
access_key_id
andsecret_access_key
: Used for accessing S3 bucket.
Step 2: Create an asset file for data ingestion
To ingest data from S3, you need to create an asset configuration file. This file defines the data flow from the source to the destination. Create a YAML file (e.g., s3_ingestion.yml) inside the assets folder and add the following content:
name: public.s3
type: ingestr
connection: postgres
parameters:
source_connection: my-s3
source_table: 'mybucket/students/students_details.csv'
destination: postgres
name
: The name of the asset.type
: Specifies the type of the asset. Set this to ingestr to use the ingestr data pipeline.connection
: This is the destination connection, which defines where the data should be stored. For example:postgres
indicates that the ingested data will be stored in a Postgres database.source_connection
: The name of the S3 connection defined in .bruin.yml.source_table
: the bucket name and file path (or file glob) separated by a forward slash (/
).
Step 3: Run asset to ingest data
bruin run assets/s3_ingestion.yml
As a result of this command, Bruin will ingest data from the given S3 table into your Postgres database.
Writing data to S3
Bruin also allows you to move data from any supported source to S3 using Ingestr assets. This is useful for exporting processed data, creating backups, or sharing data.
Follow the steps below to correctly set up S3 as a data destination and run data exports.
Step 1: Add a connection to .bruin.yml file
To write data to S3, you first need to configure an S3 connection in your .bruin.yml
file. This connection will specify the destination bucket and credentials.
connections:
s3:
- name: "my-s3-destination"
access_key_id: "YOUR_AWS_ACCESS_KEY_ID"
secret_access_key: "YOUR_AWS_SECRET_ACCESS_KEY"
bucket_name: "your-s3-bucket-name"
path_to_file: "your/destination/prefix"
name
: A unique name for this S3 connection.access_key_id
andsecret_access_key
: AWS credentials for accessing the S3 bucket.bucket_name
: The name of the S3 bucket where data will be written.path_to_file
: A base path or prefix within the bucket where files will be stored. Files specified in the asset will be relative to this path. For example, ifpath_to_file
isexports/
and your asset writesreport.csv
, the full path will beexports/report.csv
within the bucket.
Step 2: Create an asset file for data export
Next, create an asset configuration file (e.g., s3_export.asset.yml
) in your assets
folder. This file defines the data flow from your source (e.g., a database table) to S3.
name: public.results
type: ingestr
connection: my-s3-destination
parameters:
source_connection: postgres
source_table: 'public.students'
destination: s3
name
: The name of the asset .type
: Specifies the type of the asset. Set this toingestr
.connection
: The name of the S3 connection (defined in.bruin.yml
) to which data will be written. This is your destination connection.source_connection
: The name of the Bruin connection for your source database.source_table
: The fully qualified name of the table in your source database that you want to export.destination
: Set tos3
to indicate S3 as the destination type.
Step 3: Run asset to export data
Finally, run the asset to export data from your database to S3:
bruin run assets/s3_export.yml
As a result of this command, Bruin will execute the Ingestr pipeline, reading data from the specified source table and writing it to the designated S3 location. The format of the output file will be parquet.
OUTPUT
Analyzed the pipeline 'pg-to-s3' with 2 assets.Pipeline: pg-to-s3 (.) No issues found
✓ Successfully validated 1 assets across 1 pipeline, all good.
Starting the pipeline execution... Running: public.results [public.results] >> [public.results] >> Initiated the pipeline with the following: [public.results] >> Source: postgresql / public.students [public.results] >> Destination: s3 / public.results [public.results] >> Incremental Strategy: replace [public.results] >> Incremental Key: None [public.results] >> Primary Key: None [public.results] >> [public.results] >> [public.results] >> Starting the ingestion... [public.results] >> --- Extract --- [public.results] >> Resources: 0/1 (0.0%) | Time: 0.00s | Rate: 0.00/s [public.results] >> Memory usage: 236.23 MB (54.80%) | CPU usage: 0.00% [public.results] >> [public.results] >> --- Extract --- [public.results] >> Files: 0/1 (0.0%) | Time: 0.00s | Rate: 0.00/s [public.results] >> Items: 0 | Time: 0.00s | Rate: 0.00/s [public.results] >> Memory usage: 239.30 MB (54.90%) | CPU usage: 0.00% [public.results] >> [public.results] >> Normalize [public.results] >> Jobs: 0/1 (0.0%) | Time: 2.48s | Rate: 0.00/s [public.results] >> Memory usage: 263.02 MB (55.00%) | CPU usage: 0.00% [public.results] >> [public.results] >> Load [public.results] >> Jobs: 1/1 (100.0%) | Time: 2.99s | Rate: 0.33/s [public.results] >> Memory usage: 266.11 MB (55.00%) | CPU usage: 0.00% [public.results] >> [public.results] >> Successfully finished loading data from 'postgresql' to 's3' in 5.69 seconds [public.results] >> Finished: public.results (5.69s)
Writing to a S3-Compatible storage
Bruin supports writing data to any S3 Compatible storage like minio, Digital Ocean spaces or Cloudflare R2.
To use an S3-compatible storage service, you need to configure the endpoint_url
in your S3 connection settings within the .bruin.yml
file. This URL should point to the API server of the S3-compatible storage service you are using.
For example, if you are using MinIO, your connection configuration might look like this:
connections:
s3:
- name: "my-minio-destination"
access_key_id: "YOUR_MINIO_ACCESS_KEY"
secret_access_key: "YOUR_MINIO_SECRET_KEY"
bucket_name: "your-minio-bucket-name"
path_to_file: "your/destination/prefix"
endpoint_url: "http://your-minio-server:9000"
endpoint_url
: The API endpoint of your S3-compatible storage service.
NOTE
endpoint_url
is used to enable using an S3-compatible service, such as GCS or Minio, as a destination.
The rest of the setup, including creating asset files and running the export, remains the same as described in the "Writing data to S3" section. By specifying the endpoint_url
, Bruin will direct Ingestr to interact with your chosen S3-compatible provider instead of AWS S3.
Controlling the layout
When writing data to S3 or an S3-compatible storage, Bruin allows you to control the naming and structure of the output files using the layout
parameter in your S3 connection configuration within .bruin.yml
. This parameter provides a way to customize the output path and filename based on variables like the table name and extension.
If the layout
parameter is not specified, the output structure follows the default behavior of ingestr
. Typically, ingestr
creates a folder named after the source table and places the data file (e.g., a Parquet file) within it. For instance, exporting a public.users
table to s3://my-bucket/exports/
would result in an output file path like s3://my-bucket/exports/public.users/<load_id>.<file_id>.parquet
.
To customize this, you can add the layout
field to your S3 connection.
For example, to save the output as a Parquet file named directly after the table specified by your asset (for example sales
)
connections:
s3:
- name: "my-s3-custom-layout"
access_key_id: "YOUR_ACCESS_KEY_ID"
secret_access_key: "YOUR_SECRET_ACCESS_KEY"
bucket_name: "your-s3-bucket"
path_to_file: "exports/"
layout: "{table_name}.{ext}"
Will result in the output file being written to
s3://your-s3-bucket/exports/sales.parquet
.
You can find a list of available variables here