Quickstart
Make sure you have installed Bruin CLI already in your device.
INFO
The steps here assume you have DuckDB installed. You can replace to any reference to duckdb.sql
with your data warehouse if you'd like to run the examples elsewhere.
Create your first pipeline
Bruin includes a handy command called init
, you can simply run that command to create a new Bruin project.
bruin init default my-pipeline
This command will:
- Create a project named
my-pipeline
. - Generate a folder called
my-pipeline
containing the following:- An
assets
folder .bruin.yml
file (where you will add connection and credentials )pipeline.yml
file to manage your pipeline.
- An
Once you have the project structure, you can run the whole pipeline:
bruin run
Adding a new asset
In a nutshell, an asset is anything that generates value with data. In practice, an asset can be a table in your data warehouse, a Python script, or a file in S3. Bruin represents assets in code, put under the assets
folder.
In order to create a new asset, you can simply create a new file under the assets
folder. Let's start with an ingestr asset.
Creating a ingestr
asset
Let's start by ingesting some data from an external source.
ingestr is an open-source CLI tool that allows ingesting data from any source into any destination.
You can create ingestr assets with a file assets/players.asset.yml
name: dataset.players
type: ingestr
parameters:
destination: duckdb
source_connection: chess-default
source_table: profiles
The configuration in the YAML file has a few bits:
name
: the name of the asset, also the table that will store the ingested dataconnection
: the destination that will be used to load the dataparameters
: the configuration for ingestrsource_connection
: the connection that will be used for the source datasource_table
: the table that will be ingested
That's it, this asset will load data from the chess
source and load it into your DuckDB database.
Setting up your .bruin.yml
file
To ensure the asset works correctly, configure your environments and connections by editing your .bruin.yml
file. This file specifies environments and the connections your pipeline will use.
Modify the .bruin.yml
file as follows:
default_environment: default
environments:
default:
connections:
duckdb:
- name: "duckdb-default"
path: "duckdb.db"
chess:
- name: "chess-default"
players:
- "MagnusCarlsen"
- "Hikaru"
You can run this asset either via the Bruin VS Code extension, or in the terminal:
bruin run assets/players.asset.yml
Creating a SQL asset
Let's create a new SQL asset with a file assets/player_stats.sql
:
/* @bruin
name: dataset.player_stats
type: duckdb.sql
materialization:
type: table
depends:
- dataset.players
@bruin */
SELECT name, count(*)
FROM dataset.players
GROUP BY 1
This asset have a few lines of configuration at the top:
name
: the name of the asset, needs to be unique within a pipelinetype
:duckdb.sql
means DuckDB SQL, Bruin supports many other types of assets.materialization
: take the query result and materialize it as a table
Bruin will take the result of the given query, and will create a dataset.player_stats
table on DuckDB with it. You can also use view
materialization type instead of table
to create a view instead.
INFO
Bruin supports many asset types, including BigQuery, Snowflake, Python, Redshift, Databricks, and more.
You can run this asset either via the Bruin VS Code extension, or in the terminal:
bruin run assets/player_stats.sql
Creating a Python asset
Similar to SQL, Bruin supports running Python natively as well.
You can create a Python asset with a file assets/my_python_asset.py
:
"""@bruin
name: my_python_asset
@bruin"""
print('hello world')
- The sections between
"""@bruin
and@bruin"""
are comment markers, they are the configuration for the asset. - The rest of the body is a regular Python script.
You can run this asset either via the Bruin VS Code extension, or in the terminal:
bruin run assets/my_python_asset.sql
At this point, you can also run the whole pipeline:
bruin run
Data Quality Checks
Bruin supports data quality checks natively, as part of the asset definition. It includes a handful of data quality checks, and it also supports custom checks.
Let's add a few data quality checks to our table:
/* @bruin
name: dataset.player_stats
type: duckdb.sql
materialization:
type: table
depends:
- dataset.players
# you can define column metadata and quality checks
columns:
- name: name
type: string
description: this column contains the player names
checks:
- name: not_null
- name: unique
- name: player_count
type: int
description: the number of players with the given name
checks:
- name: not_null
- name: positive
# you can also define custom checks
custom_checks:
- name: row count is greater than zero
description: this check ensures that the table is not empty
query: SELECT count(*) > 1 FROM dataset.player_count
@bruin */
SELECT name, count(*)
FROM dataset.players
GROUP BY 1
We have added a new columns
section in the asset, it contains a list of all the columns and the checks applied to them.
Under the checks
section, each column defines some quality checks:
name
column is marked to be not null and uniqueplayer_count
column is marked to be not null and consisting of positive numbers
Under the custom_checks
section, we have added a new custom quality check that checks if the table is empty or not.
You can refresh the asset & run the quality checks via simply running the asset:
bruin run assets/player_stats.sql
If you'd like to run only the checks, you can run:
bruin run --only checks assets/player_stats.sql
Next steps
You have created your first pipeline, ingested data from a new source, added a bunch of assets, and ran quality checks on that, you are ready to dig deeper. Jump into the Concepts to learn more about the underlying concepts Bruin use for your data pipelines.