data-diff
Command
The data-diff
command compares data between two tables from the same or different data sources. It provides comprehensive schema comparison, statistical analysis, and data profiling to help identify differences between datasets across environments or sources.
This command is particularly useful for:
- Comparing tables between development, staging, and production environments
- Validating data migrations and ETL processes
- Monitoring data drift between different data sources
- Ensuring data consistency across different systems
Usage
bruin data-diff [FLAGS] <table1> <table2>
Arguments:
- table1: The first table to compare. Can be specified as
connection:table
or justtable
if using a default connection. - table2: The second table to compare. Can be specified as
connection:table
or justtable
if using a default connection.
Flags
Flag | Type | Default | Description |
---|---|---|---|
--connection , -c | str | - | Name of the default connection to use when connection is not specified in table arguments |
--tolerance , -t | float | 0.001 | Tolerance percentage for considering values equal. Values with percentage difference below this threshold are considered equal |
--config-file | str | .bruin.yml | The path to the .bruin.yml configuration file |
Table Identifier Format
Tables can be specified in two formats:
With connection prefix:
connection_name:table_name
- Example:
prod_db:users
,staging_bq:events
- Example:
Without connection prefix:
table_name
- Requires the
--connection
flag to specify the default connection - Example:
users
(when using--connection prod_db
)
- Requires the
What Gets Compared
The data-diff
command performs a comprehensive comparison that includes:
Schema Comparison
- Column names and types: Identifies missing, extra, and mismatched columns
- Data type compatibility: Checks if different types are comparable (e.g.,
VARCHAR
vsSTRING
) - Nullability constraints: Compares nullable/not-null settings
- Uniqueness constraints: Compares unique/non-unique settings
Statistical Analysis
For each column that exists in both tables, the command provides detailed statistics based on the column's data type:
Numerical Columns
- Row count and null count
- Fill rate (percentage of non-null values)
- Min, Max, Average, Sum values
- Standard deviation
String Columns
- Row count and null count
- Fill rate (percentage of non-null values)
- Distinct value count
- Empty string count
- Min, Max, and Average string lengths
Boolean Columns
- Row count and null count
- Fill rate (percentage of non-null values)
- True and False counts
DateTime Columns
- Row count and null count
- Fill rate (percentage of non-null values)
- Distinct value count
- Earliest and Latest dates
JSON Columns
- Row count and null count
- Fill rate (percentage of non-null values)
Difference Calculation
- Absolute differences: Raw numeric differences between values
- Percentage differences: Relative changes as percentages
- Tolerance handling: Values within the specified tolerance are considered equal
- Color-coded output: Green for matches/small differences, red for significant differences
Output Format
The command generates several detailed tables:
- Summary Table: High-level overview of row counts and schema differences
- Column Types Comparison: Side-by-side comparison of column types
- Schema Comparison: Detailed breakdown of matching vs different columns
- Column Differences: Specific differences for columns that exist in both tables
- Missing Columns: Columns that exist in one table but not the other
- Statistical Comparison: Detailed statistics for each common column
Examples
Basic Usage
Compare two tables using explicit connection names:
bruin data-diff prod_db:users staging_db:users
Compare tables using a default connection:
bruin data-diff --connection my_db users_v1 users_v2
Cross-Environment Comparison
Compare the same table across different environments:
bruin data-diff prod_bq:analytics.users dev_bq:analytics.users
Custom Tolerance
Use a higher tolerance for considering values equal (useful for floating-point comparisons):
bruin data-diff --tolerance 0.1 prod_db:metrics staging_db:metrics
Custom Config File
Specify a different configuration file:
bruin data-diff --config-file /path/to/custom/.bruin.yml prod:table1 staging:table2
Supported Data Platforms
The data-diff
command includes specialized type mapping support for the following data platforms:
- DuckDB - Full support for DuckDB data types including
HUGEINT
,UHUGEINT
, and specialized time types - BigQuery - Native support for BigQuery types including
INT64
,FLOAT64
,BIGNUMERIC
, and BigQuery-specific formatting - PostgreSQL & AWS Redshift - Complete support for PostgreSQL types including
SERIAL
types,MONEY
, network types (CIDR
,INET
), andJSONB
- Snowflake - Full support for Snowflake types including
NUMBER
,VARIANT
, and timezone-aware timestamp types
Type Mapping Features
Each supported platform includes intelligent type mapping that:
- Normalizes data types across platforms (e.g., PostgreSQL
INTEGER
and BigQueryINT64
both map tonumeric
) - Handles parametrized types automatically (e.g.,
VARCHAR(255)
andVARCHAR(100)
are both treated asstring
) - Supports case-insensitive matching for type names
- Maps platform-specific types to common categories for cross-platform comparisons
Cross-Platform Comparisons
When comparing tables between different platforms, the command intelligently maps data types to common categories:
- Numeric types: All integer, float, decimal, and monetary types
- String types: VARCHAR, CHAR, TEXT, and similar text types
- Boolean types: BOOL, BOOLEAN, and logical types
- DateTime types: DATE, TIME, TIMESTAMP, and interval types
- Binary types: BLOB, BYTEA, BINARY, and similar binary types
- JSON types: JSON, JSONB, VARIANT, and structured data types
Connection Requirements
To use the data-diff
command, your connections must support table summarization. This means the connection type must implement the TableSummarizer
interface, which includes the ability to:
- Retrieve table schema information
- Calculate statistical summaries for different column types
- Handle various data types (numerical, string, boolean, datetime, JSON)
For optimal results, use one of the fully supported data platforms listed above. Other database connections may work but may have limited type mapping capabilities.
Use Cases
Data Migration Validation
# Compare source and target after migration
bruin data-diff source_db:customer_data target_db:customer_data
Environment Consistency Checks
If you are comparing different environments, you can use the --schema-only
flag to only compare the schema of the tables and not the data.
# Ensure staging matches production structure
bruin data-diff --schema-only prod:important_table staging:important_table --tolerance 0.01
ETL Process Monitoring
# Compare before and after transformation
bruin data-diff raw_data:events processed_data:events_cleaned
Data Quality Monitoring
# Check for unexpected changes in data distribution
bruin data-diff yesterday_snapshot:metrics today_snapshot:metrics
Troubleshooting
Error: "connection type does not support table summarization"
The specified connection type doesn't support the required table analysis features.
If you'd like to see another platform supported, feel free to open an issue on our GitHub repository.
Error: "incorrect number of arguments"
The command requires exactly two table arguments.
Verify you've provided both table identifiers.
Error: "connection not specified for table"
Table identifier doesn't include a connection prefix and no default connection was set.
Either use the format connection:table
or add the --connection
flag.
Error: "failed to get connection"
The specified connection name doesn't exist in your configuration.
Check your .bruin.yml
file for available connections.