Skip to content

Google Cloud Storage

Google Cloud Storage (GCS) is an online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing capabilities. It is an Infrastructure as a Service (IaaS), comparable to Amazon S3.

Bruin supports GCS as a source and a destination for Ingestr assets, and you can use it to move data to and from your data warehouse.

In order to set up the GCS connection, you need to add a configuration item in the .bruin.yml file and in the asset file. You will need the service_account_file or service_account_json. For details on how to obtain these credentials, please refer here.

Reading data from GCS

Follow the steps below to correctly set up GCS as a data source and run ingestion.

Step 1: Add a connection to .bruin.yml file

To connect to GCS, you need to add a configuration item to the connections section of the .bruin.yml file. This configuration must comply with the following schema:

yaml
    connections:
      gcs:
          # name of your connection
        - name: "my-gcs"
          # you can either specify a path to the service account file
          service_account_file: "path/to/file.json"
          # or you can specify the service account json directly
          service_account_json: |
            {
              "type": "service_account",
              ...
            }
  • service_account_file: The path to the service account JSON file
  • service_account_json: The service account JSON content itself

Step 2: Create an asset file for data ingestion

To ingest data from GCS, you need to create an asset configuration file. This file defines the data flow from the source to the destination. Create a YAML file (e.g., gcs_ingestion.yml) inside the assets folder and add the following content:

yaml
name: public.gcs
type: ingestr
connection: postgres

parameters:
  source_connection: my-gcs
  source_table: 'my-bucket/students_details.csv'

  destination: postgres
  • name: The name of the asset.
  • type: Specifies the type of the asset. Set this to ingestr to use the ingestr data pipeline.
  • connection: This is the destination connection, which defines where the data should be stored. For example: postgres indicates that the ingested data will be stored in a Postgres database.
  • source_connection: The name of the gcs connection defined in .bruin.yml.
  • source_table: bucket name and file path (or file glob) separated by a forward slash (/).

Step 3: Run asset to ingest data

bruin run assets/gcs_ingestion.yml

As a result of this command, Bruin will ingest data from the given gcs bucket into your Postgres database.

Writing to a GCS

Follow the steps below to correctly set up GCS as a destination and run ingestion.

Step 1: Add a connection to .bruin.yml file

To connect to GCS, you need to add a configuration item to the connections section of the .bruin.yml file. This configuration must comply with the following schema:

yaml
    connections:
      gcs:
        - name: "gcs"
          # you can either specify a path to the service account file
          service_account_file: "path/to/file.json"
          # or you can specify the service account json directly
          service_account_json: |
            {
              "type": "service_account",
              ...
            }
          bucket_name: "my-org-bucket"
          path_to_file: "records"
          layout: "{table_name}.{ext}" #optional
  • service_account_file: The path to the service account JSON file
  • service_account_json: The service account JSON content itself
  • bucket_name: The name of the GCS bucket where data will be written.
  • path_to_file: A base path or prefix within the bucket where files will be stored. Files specified in the asset will be relative to this path
  • layout: Layout template (optional, destination only). If you would like to create a parquet file with the same name as the source table (as opposed to a folder) you can set layout to {table_name}.{ext}. List of available Layout variables is available here

Step 2: Create an asset file for data ingestion

To ingest data to GCS, you need to create an asset configuration file. This file defines the data flow from the source to the destination. Create a YAML file (e.g., stripe_gcs.yml) inside the assets folder and add the following content:

yaml
name: public.final
type: ingestr
connection: gcs

parameters:
  source_connection: stripe
  source_table: 'event'

  destination: gcs
  • name: The name of the asset.
  • type: Specifies the type of the asset. Set this to ingestr to use the ingestr data pipeline.
  • connection: This is the destination connection, which defines where the data should be stored. For example: gcs indicates that the ingested data will be stored in a GCS database.

Step 3: Run asset to ingest data

bruin run assets/stripe_gcs.yml

As a result of this command, Bruin will ingest data from the given Stripe source to your GCS database.