Glossary
Bruin focuses on enabling independent teams designing independent data products. These products ought to be built and developed independently, while all working towards a cohesive data strategy.
One of the most important aspects of this aligned approach to building data products is agreeing on the language. For instance, if you go to an e-commerce company and ask different individuals in different teams "what is a customer for you?", you will get different answers:
- a CRM manager might say: "a customer is someone who signed up to our email list.
- a backend engineer could say: "a customer is a row in the
customers
table in the backend. - a marketing manager could say: "a customer is everyone that successfully finished signing up to our platform"
- a product manager could say: "a customer is someone who purchased at least one item through us"
The list can go on further, but you get the idea: different teams have the same name for different concepts, and aligning on these concepts are a crucial part of building successful data products.
In order to align on different teams on building on a shared language, Bruin has a feature called "glossary".
INFO
Glossary is a beta feature, there may be unexpected behavior or mistakes while utilizing them in assets.
Entities & Attributes
Glossaries in Bruin support two primary concepts at the time of writing:
- Entity: a high-level business entity that is not necessarily tied to a data asset, e.g.
Customer
orOrder
- Attribute: the logical attributes of an entity, e.g.
ID
for aCustomer
, orAddress
for anOrder
.- Attributes have names, types and descriptions.
An entity can have zero or more attributes, while an attribute must always be within an entity.
INFO
Glossaries are primarily utilized for entities in its first version. In the future they will be used to incorporate further business concepts.
Defining a glossary
In order to define your glossary, you need to put a file called glossary.yml
at the root of the repo:
- The file
glossary.yml
must be at the root of the repo. - The file must be named
glossary.yml
orglossary.yaml
, nothing else.
Below is an example glossary.yml
file that defines 2 entities, a Customer
entity and an Address
entity:
# The `entities` key is used to define entities within the glossary, which can then be referred by different assets.
entities:
Customer:
description: Customer is an individual/business that has registered on our platform.
attributes:
ID:
type: integer
description: The unique identifier of the customer in our systems.
Email:
type: string
description: the e-mail address the customer used while registering on our website.
Language:
type: string
description: the language the customer picked during registration.
# You can define multi-line descriptions, and give further details or references for others.
Address:
description: |
An address represent a physical, real-world location that is used across various entities, such as customer or order.
These addresses can be anywhere in the world, there is no country/geography limitation.
The addresses are not validated beforehand, therefore the addresses are not guaranteed to be real.
attributes:
ID:
type: integer
description: The unique identifier of the address in our systems.
Street:
type: string
description: The given street name for the address, depending on the country it may have a different structure.
Country:
type: string
description: The country of the address, represents a real country.
CountryCode:
type: string
description: The ISO country code for the given country.
The file structure is flexible enough to allow conceptual attributes to be defined here. You can define unlimited number of entities and attributes.
Schema
The glossary.yml
file has a rather simple schema:
entities
: must be key-value pairs of string - an Entity objectEntity
object:description
: string, supports markdown descriptionsattributes
: a key-value map, where the key is the name of the attribute, the value is an object.type
: the data type of the attributedescription
: the markdown description of the given column
Take a look at the example above and modify it as per your needs.
Utilizing entities in assets via extends
One of the early uses of entities is addressing & documenting concepts that repeat in multiple places. For instance, let's say you have a "age" property that is used across 5 different assets, this means you would have to go and document each of these columns one by one inside the assets. Using "entities", you can instead refer them all to a single attribute using the extends
keyword.
name: raw.customers
columns:
- name: customer_id
extends: Customer.ID
Let's take a look at the extends
key here:
- The format it follows is
<entity>.<attribute>
, e.g.Customer.ID
refers to theID
attribute in theCustomer
entity. - In this example, the column
id
extends the attributeID
ofCustomer
, which means it will take the attribute definition as the default:- There's already a
name
defined,id
, that takes priority. - There's no
description
for the column, and the attribute has that, therefore take thedescription
from theCustomer.ID
attribute. - There's no
type
defined for the column, therefore take thetype
from theCustomer.ID
attribute too.
- There's already a
In the end, the resulting asset will behave as if it is defined this way:
name: raw.customers
columns:
- name: customer_id
description: The unique identifier of the customer in our systems. # this comes from the `ID` attribute of the `Customer` entity
type: integer # this comes from the `ID` attribute of the `Customer` entity
Thanks to the entities, you don't have to repeat definitions across assets.
Order of priority
Entities are used as defaults when there are no explicit definitions on an asset column. Entity attributes support the following fields:
name
: the name of the attributetype
: the type of the attributedescription
: the human-readable description of the attribute, ideally in relation to the business
Bruin will take all of the fields from the attribute, and combine them with the asset column:
- if the column definition already has a value for a field, use that.
- if not, and the entity attribute has that, use that.
- if neither has the value for the field, leave it empty.
This means, the following asset definition will still produce a valid asset:
name: raw.customers
columns:
- extends: Customer.ID # use `ID` as the column name, `integer` as the `type`, and `description` from the attribute
- extends: Customer.Email # use `Email` as the column name, `string` as the `type`, and `description` from the attribute
Bruin will parse the extends
references, and merge them with the corresponding attribute definitions.
WARNING
Explicit definition of a column will always take priority over entity attributes. Attributes are there to provide defaults, not to override explicit definitions on an asset level. Asset has higher priority than the glossary.