Skip to content

S3

Amazon Simple Storage Service S3 is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface.Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e-commerce network. Amazon S3 can store any type of object, which allows use cases like storage for Internet applications, backups, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage.

Bruin supports S3 via Ingestr assets, and you can use it to move data to and from your data warehouse.

Reading data from S3

In order to set up the S3 connection, you need to add a configuration item in the .bruin.yml file and in the asset file. You will need the access_key_id and secret_access_key. For details on how to obtain these credentials, please refer here.

Follow the steps below to correctly set up S3 as a data source and run ingestion.

Step 1: Add a connection to .bruin.yml file

To connect to S3, you need to add a configuration item to the connections section of the .bruin.yml file. This configuration must comply with the following schema:

yaml
    connections:
      s3:
        - name: "my-s3"
          access_key_id: "AKI_123"
          secret_access_key: "L6L_123"
  • access_key_id and secret_access_key: Used for accessing S3 bucket.

Step 2: Create an asset file for data ingestion

To ingest data from S3, you need to create an asset configuration file. This file defines the data flow from the source to the destination. Create a YAML file (e.g., s3_ingestion.yml) inside the assets folder and add the following content:

yaml
name: public.s3
type: ingestr
connection: postgres

parameters:
  source_connection: my-s3
  source_table: 'mybucket/students/students_details.csv'

  destination: postgres
  • name: The name of the asset.
  • type: Specifies the type of the asset. Set this to ingestr to use the ingestr data pipeline.
  • connection: This is the destination connection, which defines where the data should be stored. For example: postgres indicates that the ingested data will be stored in a Postgres database.
  • source_connection: The name of the S3 connection defined in .bruin.yml.
  • source_table: the bucket name and file path (or file glob) separated by a forward slash (/).

Step 3: Run asset to ingest data

bruin run assets/s3_ingestion.yml

As a result of this command, Bruin will ingest data from the given S3 table into your Postgres database.

S3

Writing data to S3

Bruin also allows you to move data from any supported source to S3 using Ingestr assets. This is useful for exporting processed data, creating backups, or sharing data.

Follow the steps below to correctly set up S3 as a data destination and run data exports.

Step 1: Add a connection to .bruin.yml file

To write data to S3, you first need to configure an S3 connection in your .bruin.yml file. This connection will specify the destination bucket and credentials.

yaml
    connections:
      s3: 
        - name: "my-s3-destination" 
          access_key_id: "YOUR_AWS_ACCESS_KEY_ID"
          secret_access_key: "YOUR_AWS_SECRET_ACCESS_KEY"
          bucket_name: "your-s3-bucket-name"
          path_to_file: "your/destination/prefix"
  • name: A unique name for this S3 connection.
  • access_key_id and secret_access_key: AWS credentials for accessing the S3 bucket.
  • bucket_name: The name of the S3 bucket where data will be written.
  • path_to_file: A base path or prefix within the bucket where files will be stored. Files specified in the asset will be relative to this path. For example, if path_to_file is exports/ and your asset writes report.csv, the full path will be exports/report.csv within the bucket.

Step 2: Create an asset file for data export

Next, create an asset configuration file (e.g., s3_export.asset.yml) in your assets folder. This file defines the data flow from your source (e.g., a database table) to S3.

yaml
name: public.results 
type: ingestr
connection: my-s3-destination

parameters:
  source_connection: postgres 
  source_table: 'public.students' 

  destination: s3
  • name: The name of the asset .
  • type: Specifies the type of the asset. Set this to ingestr.
  • connection: The name of the S3 connection (defined in .bruin.yml) to which data will be written. This is your destination connection.
  • source_connection: The name of the Bruin connection for your source database.
  • source_table: The fully qualified name of the table in your source database that you want to export.
  • destination: Set to s3 to indicate S3 as the destination type.

Step 3: Run asset to export data

Finally, run the asset to export data from your database to S3:

bash
bruin run assets/s3_export.yml

As a result of this command, Bruin will execute the Ingestr pipeline, reading data from the specified source table and writing it to the designated S3 location. The format of the output file will be parquet.

OUTPUT

Analyzed the pipeline 'pg-to-s3' with 2 assets.

Pipeline: pg-to-s3 (.) No issues found

✓ Successfully validated 1 assets across 1 pipeline, all good.

Starting the pipeline execution... Running: public.results [public.results] >> [public.results] >> Initiated the pipeline with the following: [public.results] >> Source: postgresql / public.students [public.results] >> Destination: s3 / public.results [public.results] >> Incremental Strategy: replace [public.results] >> Incremental Key: None [public.results] >> Primary Key: None [public.results] >> [public.results] >> [public.results] >> Starting the ingestion... [public.results] >> --- Extract --- [public.results] >> Resources: 0/1 (0.0%) | Time: 0.00s | Rate: 0.00/s [public.results] >> Memory usage: 236.23 MB (54.80%) | CPU usage: 0.00% [public.results] >> [public.results] >> --- Extract --- [public.results] >> Files: 0/1 (0.0%) | Time: 0.00s | Rate: 0.00/s [public.results] >> Items: 0 | Time: 0.00s | Rate: 0.00/s [public.results] >> Memory usage: 239.30 MB (54.90%) | CPU usage: 0.00% [public.results] >> [public.results] >> Normalize [public.results] >> Jobs: 0/1 (0.0%) | Time: 2.48s | Rate: 0.00/s [public.results] >> Memory usage: 263.02 MB (55.00%) | CPU usage: 0.00% [public.results] >> [public.results] >> Load [public.results] >> Jobs: 1/1 (100.0%) | Time: 2.99s | Rate: 0.33/s [public.results] >> Memory usage: 266.11 MB (55.00%) | CPU usage: 0.00% [public.results] >> [public.results] >> Successfully finished loading data from 'postgresql' to 's3' in 5.69 seconds [public.results] >> Finished: public.results (5.69s)

Writing to a S3-Compatible storage

Bruin supports writing data to any S3 Compatible storage like minio, Digital Ocean spaces or Cloudflare R2.

To use an S3-compatible storage service, you need to configure the endpoint_url in your S3 connection settings within the .bruin.yml file. This URL should point to the API server of the S3-compatible storage service you are using.

For example, if you are using MinIO, your connection configuration might look like this:

yaml
    connections:
      s3:
        - name: "my-minio-destination"
          access_key_id: "YOUR_MINIO_ACCESS_KEY"
          secret_access_key: "YOUR_MINIO_SECRET_KEY"
          bucket_name: "your-minio-bucket-name"
          path_to_file: "your/destination/prefix"
          endpoint_url: "http://your-minio-server:9000"
  • endpoint_url: The API endpoint of your S3-compatible storage service.

NOTE

endpoint_url is used to enable using an S3-compatible service, such as GCS or Minio, as a destination.

The rest of the setup, including creating asset files and running the export, remains the same as described in the "Writing data to S3" section. By specifying the endpoint_url, Bruin will direct Ingestr to interact with your chosen S3-compatible provider instead of AWS S3.

Controlling the layout

When writing data to S3 or an S3-compatible storage, Bruin allows you to control the naming and structure of the output files using the layout parameter in your S3 connection configuration within .bruin.yml. This parameter provides a way to customize the output path and filename based on variables like the table name and extension.

If the layout parameter is not specified, the output structure follows the default behavior of ingestr. Typically, ingestr creates a folder named after the source table and places the data file (e.g., a Parquet file) within it. For instance, exporting a public.users table to s3://my-bucket/exports/ would result in an output file path like s3://my-bucket/exports/public.users/<load_id>.<file_id>.parquet.

To customize this, you can add the layout field to your S3 connection.

For example, to save the output as a Parquet file named directly after the table specified by your asset (for example sales)

yaml
    connections:
      s3:
        - name: "my-s3-custom-layout"
          access_key_id: "YOUR_ACCESS_KEY_ID"
          secret_access_key: "YOUR_SECRET_ACCESS_KEY"
          bucket_name: "your-s3-bucket"
          path_to_file: "exports/" 
          layout: "{table_name}.{ext}"

Will result in the output file being written to

  • s3://your-s3-bucket/exports/sales.parquet.

You can find a list of available variables here