`data-diff` Command

The data-diff command compares data between two tables from the same or different data sources. It provides comprehensive schema comparison, statistical analysis, and data profiling to help identify differences between datasets across environments or sources.

This command is particularly useful for:

Comparing tables between development, staging, and production environments
Validating data migrations and ETL processes
Monitoring data drift between different data sources
Ensuring data consistency across different systems

Usage

bash

bruin data-diff [FLAGS] <table1> <table2>

Arguments:

table1: The first table to compare. Can be specified as connection:table or just table if using a default connection.
table2: The second table to compare. Can be specified as connection:table or just table if using a default connection.

Flags

Flag	Type	Default	Description
`--connection`, `-c`	str	-	Name of the default connection to use when connection is not specified in table arguments
`--tolerance`, `-t`	float	`0.001`	Tolerance percentage for considering values equal. Values with percentage difference below this threshold are considered equal
`--config-file`	str	`.bruin.yml`	The path to the .bruin.yml configuration file

Table Identifier Format

Tables can be specified in two formats:

With connection prefix: connection_name:table_name
- Example: prod_db:users, staging_bq:events
Without connection prefix: table_name
- Requires the --connection flag to specify the default connection
- Example: users (when using --connection prod_db)

What Gets Compared

The data-diff command performs a comprehensive comparison that includes:

Schema Comparison

Column names and types: Identifies missing, extra, and mismatched columns
Data type compatibility: Checks if different types are comparable (e.g., VARCHAR vs STRING)
Nullability constraints: Compares nullable/not-null settings
Uniqueness constraints: Compares unique/non-unique settings

Statistical Analysis

For each column that exists in both tables, the command provides detailed statistics based on the column's data type:

Numerical Columns

Row count and null count
Fill rate (percentage of non-null values)
Min, Max, Average, Sum values
Standard deviation

String Columns

Row count and null count
Fill rate (percentage of non-null values)
Distinct value count
Empty string count
Min, Max, and Average string lengths

Boolean Columns

Row count and null count
Fill rate (percentage of non-null values)
True and False counts

DateTime Columns

Row count and null count
Fill rate (percentage of non-null values)
Distinct value count
Earliest and Latest dates

JSON Columns

Row count and null count
Fill rate (percentage of non-null values)

Difference Calculation

Absolute differences: Raw numeric differences between values
Percentage differences: Relative changes as percentages
Tolerance handling: Values within the specified tolerance are considered equal
Color-coded output: Green for matches/small differences, red for significant differences

Output Format

The command generates several detailed tables:

Summary Table: High-level overview of row counts and schema differences
Column Types Comparison: Side-by-side comparison of column types
Schema Comparison: Detailed breakdown of matching vs different columns
Column Differences: Specific differences for columns that exist in both tables
Missing Columns: Columns that exist in one table but not the other
Statistical Comparison: Detailed statistics for each common column

Examples

Basic Usage

Compare two tables using explicit connection names:

bash

bruin data-diff prod_db:users staging_db:users

Compare tables using a default connection:

bash

bruin data-diff --connection my_db users_v1 users_v2

Cross-Environment Comparison

Compare the same table across different environments:

bash

bruin data-diff prod_bq:analytics.users dev_bq:analytics.users

Custom Tolerance

Use a higher tolerance for considering values equal (useful for floating-point comparisons):

bash

bruin data-diff --tolerance 0.1 prod_db:metrics staging_db:metrics

Custom Config File

Specify a different configuration file:

bash

bruin data-diff --config-file /path/to/custom/.bruin.yml prod:table1 staging:table2

Supported Data Platforms

The data-diff command includes specialized type mapping support for the following data platforms:

DuckDB - Full support for DuckDB data types including HUGEINT, UHUGEINT, and specialized time types
BigQuery - Native support for BigQuery types including INT64, FLOAT64, BIGNUMERIC, and BigQuery-specific formatting
PostgreSQL & AWS Redshift - Complete support for PostgreSQL types including SERIAL types, MONEY, network types (CIDR, INET), and JSONB
Snowflake - Full support for Snowflake types including NUMBER, VARIANT, and timezone-aware timestamp types

Type Mapping Features

Each supported platform includes intelligent type mapping that:

Normalizes data types across platforms (e.g., PostgreSQL INTEGER and BigQuery INT64 both map to numeric)
Handles parametrized types automatically (e.g., VARCHAR(255) and VARCHAR(100) are both treated as string)
Supports case-insensitive matching for type names
Maps platform-specific types to common categories for cross-platform comparisons

Cross-Platform Comparisons

When comparing tables between different platforms, the command intelligently maps data types to common categories:

Numeric types: All integer, float, decimal, and monetary types
String types: VARCHAR, CHAR, TEXT, and similar text types
Boolean types: BOOL, BOOLEAN, and logical types
DateTime types: DATE, TIME, TIMESTAMP, and interval types
Binary types: BLOB, BYTEA, BINARY, and similar binary types
JSON types: JSON, JSONB, VARIANT, and structured data types

Connection Requirements

To use the data-diff command, your connections must support table summarization. This means the connection type must implement the TableSummarizer interface, which includes the ability to:

Retrieve table schema information
Calculate statistical summaries for different column types
Handle various data types (numerical, string, boolean, datetime, JSON)

For optimal results, use one of the fully supported data platforms listed above. Other database connections may work but may have limited type mapping capabilities.

Use Cases

Data Migration Validation

bash

# Compare source and target after migration
bruin data-diff source_db:customer_data target_db:customer_data

Environment Consistency Checks

If you are comparing different environments, you can use the --schema-only flag to only compare the schema of the tables and not the data.

bash

# Ensure staging matches production structure
bruin data-diff --schema-only prod:important_table staging:important_table --tolerance 0.01

ETL Process Monitoring

bash

# Compare before and after transformation
bruin data-diff raw_data:events processed_data:events_cleaned

Data Quality Monitoring

bash

# Check for unexpected changes in data distribution
bruin data-diff yesterday_snapshot:metrics today_snapshot:metrics

Troubleshooting

Error: "connection type does not support table summarization"

The specified connection type doesn't support the required table analysis features.

If you'd like to see another platform supported, feel free to open an issue on our GitHub repository.

Error: "incorrect number of arguments"

The command requires exactly two table arguments.

Verify you've provided both table identifiers.

Error: "connection not specified for table"

Table identifier doesn't include a connection prefix and no default connection was set.

Either use the format connection:table or add the --connection flag.

Error: "failed to get connection"

The specified connection name doesn't exist in your configuration.

Check your .bruin.yml file for available connections.

Introduction

Features

Templates

VS Code Extension

Panels Overview

Side Panel

Jinja Templating

`data-diff` Command

Usage

Flags

Table Identifier Format

What Gets Compared

Schema Comparison

Statistical Analysis

Numerical Columns

String Columns

Boolean Columns

DateTime Columns

JSON Columns

Difference Calculation

Output Format

Examples

Basic Usage

Cross-Environment Comparison

Custom Tolerance

Custom Config File

Supported Data Platforms

Type Mapping Features

Cross-Platform Comparisons

Connection Requirements

Use Cases

Data Migration Validation

Environment Consistency Checks

ETL Process Monitoring

Data Quality Monitoring

Troubleshooting

Error: "connection type does not support table summarization"

Error: "incorrect number of arguments"

Error: "connection not specified for table"

Error: "failed to get connection"

Panels Overview

Side Panel

data-diff Command ​

Usage ​

Flags ​

Table Identifier Format ​

What Gets Compared ​

Schema Comparison ​

Statistical Analysis ​

Numerical Columns ​

String Columns ​

Boolean Columns ​

DateTime Columns ​

JSON Columns ​

Difference Calculation ​

Output Format ​

Examples ​

Basic Usage ​

Cross-Environment Comparison ​

Custom Tolerance ​

Custom Config File ​

Supported Data Platforms ​

Type Mapping Features ​

Cross-Platform Comparisons ​

Connection Requirements ​

Use Cases ​

Data Migration Validation ​

Environment Consistency Checks ​

ETL Process Monitoring ​

Data Quality Monitoring ​

Troubleshooting ​

Error: "connection type does not support table summarization" ​

Error: "incorrect number of arguments" ​

Error: "connection not specified for table" ​

Error: "failed to get connection" ​

`data-diff` Command

Usage

Flags

Table Identifier Format

What Gets Compared

Schema Comparison

Statistical Analysis

Numerical Columns

String Columns

Boolean Columns

DateTime Columns

JSON Columns

Difference Calculation

Output Format

Examples

Basic Usage

Cross-Environment Comparison

Custom Tolerance

Custom Config File

Supported Data Platforms

Type Mapping Features

Cross-Platform Comparisons

Connection Requirements

Use Cases

Data Migration Validation

Environment Consistency Checks

ETL Process Monitoring

Data Quality Monitoring

Troubleshooting

Error: "connection type does not support table summarization"

Error: "incorrect number of arguments"

Error: "connection not specified for table"

Error: "failed to get connection"