truthound compare¶

Compare two datasets and detect data drift. This command identifies distributional changes between a baseline and current dataset.

Synopsis¶

truthound compare <baseline> <current> [OPTIONS]

Arguments¶

Argument	Required	Description
`baseline`	Yes	Path to the baseline (reference) data file
`current`	Yes	Path to the current data file to compare

Options¶

Option	Short	Default	Description
`--columns`	`-c`	All	Comma-separated columns to compare
`--method`	`-m`	`auto`	Detection method (see Detection Methods)
`--threshold`	`-t`	Auto	Custom drift threshold
`--format`	`-f`	`console`	Output format (console, json)
`--output`	`-o`	None	Output file path
`--strict`		`false`	Exit with code 1 if drift detected

Available Methods: auto, ks, psi, chi2, js, kl, wasserstein, cvm, anderson, hellinger, bhattacharyya, tv, energy, mmd

Description¶

The compare command detects data drift by comparing statistical distributions:

Distribution Comparison: Compares value distributions between datasets
Statistical Tests: Applies appropriate tests based on data types
Threshold Evaluation: Determines if changes are significant
Column-by-Column Analysis: Reports drift for each column

Data drift can indicate: - Data quality issues in the pipeline - Changes in source systems - Model performance degradation risk - Schema or business logic changes

Examples¶

Basic Comparison¶

truthound compare baseline.csv current.csv

Output:

Drift Detection Report
======================
Baseline: baseline.csv (10,000 rows)
Current: current.csv (12,000 rows)

Column Results
──────────────────────────────────────────────────────────────────
Column          Method    Score     Threshold   Drift Detected
──────────────────────────────────────────────────────────────────
age             KS        0.045     0.05        No
income          KS        0.156     0.05        Yes ⚠️
category        Chi2      0.023     0.05        No
region          Chi2      0.234     0.05        Yes ⚠️
score           PSI       0.089     0.10        No
──────────────────────────────────────────────────────────────────

Summary:
  Total Columns: 5
  Drift Detected: 2
  Status: DRIFT DETECTED

Specific Columns¶

Compare only selected columns:

truthound compare baseline.csv current.csv -c age,income,score

Specific Method¶

Force a specific detection method:

# Kolmogorov-Smirnov test
truthound compare baseline.csv current.csv --method ks

# Population Stability Index
truthound compare baseline.csv current.csv --method psi

# Chi-squared test (for categorical)
truthound compare baseline.csv current.csv --method chi2

# Jensen-Shannon divergence
truthound compare baseline.csv current.csv --method js

# Kullback-Leibler divergence
truthound compare baseline.csv current.csv --method kl

# Wasserstein (Earth Mover's) distance
truthound compare baseline.csv current.csv --method wasserstein

# Cramér-von Mises test
truthound compare baseline.csv current.csv --method cvm

# Anderson-Darling test
truthound compare baseline.csv current.csv --method anderson

# Hellinger distance
truthound compare baseline.csv current.csv --method hellinger

# Bhattacharyya distance
truthound compare baseline.csv current.csv --method bhattacharyya

# Total Variation distance
truthound compare baseline.csv current.csv --method tv

# Energy distance
truthound compare baseline.csv current.csv --method energy

# Maximum Mean Discrepancy (MMD)
truthound compare baseline.csv current.csv --method mmd

Custom Threshold¶

Set a custom drift threshold:

truthound compare baseline.csv current.csv --threshold 0.1

Strict Mode (CI/CD)¶

Exit with code 1 if drift is detected:

truthound compare baseline.csv current.csv --strict

JSON Output¶

truthound compare baseline.csv current.csv --format json -o drift_report.json

Output file:

{
  "baseline": {
    "file": "baseline.csv",
    "rows": 10000
  },
  "current": {
    "file": "current.csv",
    "rows": 12000
  },
  "has_drift": true,
  "column_results": [
    {
      "column": "age",
      "method": "ks",
      "score": 0.045,
      "threshold": 0.05,
      "p_value": 0.23,
      "has_drift": false
    },
    {
      "column": "income",
      "method": "ks",
      "score": 0.156,
      "threshold": 0.05,
      "p_value": 0.001,
      "has_drift": true
    }
  ],
  "summary": {
    "total_columns": 5,
    "drift_columns": 2,
    "drift_ratio": 0.4
  }
}

Detection Methods¶

Automatic Selection (`auto`)¶

The default auto method selects the best test based on data type:

Data Type	Method Selected
Continuous numeric	Kolmogorov-Smirnov (KS)
Categorical (few values)	Chi-squared
Categorical (many values)	Jensen-Shannon
Binary	Chi-squared

Kolmogorov-Smirnov (`ks`)¶

Best for: Continuous numeric data

Tests if two samples come from the same distribution
Returns D-statistic (max distance between CDFs) and p-value
Default threshold: 0.05 (D-statistic)

truthound compare data1.csv data2.csv --method ks

Population Stability Index (`psi`)¶

Best for: Model monitoring, production ML

Measures population shift between bins
Industry standard for model monitoring
Interpretation:
PSI < 0.1: No significant change
PSI 0.1-0.25: Moderate change
PSI > 0.25: Significant change

truthound compare train.csv production.csv --method psi

Chi-Squared (`chi2`)¶

Best for: Categorical data

Tests independence of categorical distributions
Returns χ² statistic and p-value
Requires sufficient sample size per category

truthound compare categories_old.csv categories_new.csv --method chi2

Jensen-Shannon Divergence (`js`)¶

Best for: Any distribution, robust to small samples

Symmetric measure of distribution similarity
Range: 0 (identical) to 1 (completely different)
Works for both numeric and categorical data

truthound compare data1.csv data2.csv --method js

Kullback-Leibler Divergence (`kl`)¶

Best for: Measuring information loss between distributions

Asymmetric measure (KL(P||Q) ≠ KL(Q||P))
Numeric columns only
Interpretation:
KL < 0.1: Very similar
KL 0.1-0.2: Moderate difference
KL > 0.2: Significant difference

truthound compare baseline.csv current.csv --method kl

Wasserstein Distance (`wasserstein`)¶

Best for: Intuitive "work" measurement between distributions

Also known as Earth Mover's Distance
Numeric columns only
Normalized by baseline standard deviation
Interpretable as physical distance

truthound compare baseline.csv current.csv --method wasserstein

Cramér-von Mises (`cvm`)¶

Best for: Detecting tail differences in distributions

More sensitive to tail differences than KS
Numeric columns only
Returns statistic and p-value
Default threshold: p-value < 0.05

truthound compare baseline.csv current.csv --method cvm

Anderson-Darling (`anderson`)¶

Best for: Emphasizing distribution tail differences

Most sensitive to tail deviations
Numeric columns only
Returns statistic and p-value
Useful for detecting extreme value changes

truthound compare baseline.csv current.csv --method anderson

Hellinger Distance (`hellinger`)¶

Best for: Comparing probability distributions with bounded metric

Symmetric and bounded [0, 1]
Satisfies triangle inequality (true metric)
Works for both numeric and categorical data
Range: 0 (identical) to 1 (completely different)

truthound compare baseline.csv current.csv --method hellinger

Mathematical definition: H(P,Q) = (1/√2) × √(Σ(√p_i - √q_i)²)

Bhattacharyya Distance (`bhattacharyya`)¶

Best for: Measuring distribution overlap and classification bounds

Related to Bayes classification error
Works for both numeric and categorical data
Range: [0, +∞), where 0 means identical distributions
Reports Bhattacharyya coefficient in details

truthound compare baseline.csv current.csv --method bhattacharyya

Mathematical definition: D_B(P,Q) = -ln(Σ√(p_i × q_i))

Total Variation Distance (`tv`)¶

Best for: Measuring maximum probability difference

Simple interpretation: "largest probability difference"
Symmetric and bounded [0, 1]
Satisfies triangle inequality
Works for both numeric and categorical data

truthound compare baseline.csv current.csv --method tv

Mathematical definition: TV(P,Q) = (1/2) × Σ|p_i - q_i|

Energy Distance (`energy`)¶

Best for: Detecting location and scale differences

True metric (satisfies triangle inequality)
Characterizes distributions completely
Numeric columns only
Consistent statistical test

truthound compare baseline.csv current.csv --method energy

Mathematical definition: E(P,Q) = 2×E[|X-Y|] - E[|X-X'|] - E[|Y-Y'|]

Maximum Mean Discrepancy (`mmd`)¶

Best for: High-dimensional data and kernel-based comparison

Non-parametric (no density estimation required)
Works well in high dimensions
Numeric columns only
Supports different kernels (RBF, linear, polynomial)

truthound compare baseline.csv current.csv --method mmd

Mathematical definition: MMD²(P,Q) = E[k(X,X')] + E[k(Y,Y')] - 2×E[k(X,Y)]

Method Comparison¶

Method	Numeric	Categorical	Sensitivity	Use Case
KS	Excellent	Poor	High	Statistical testing
PSI	Good	Good	Medium	ML monitoring
Chi2	Poor	Excellent	Medium	Category changes
JS	Good	Good	Low	General purpose
KL	Good	Poor	Medium	Information theory
Wasserstein	Excellent	Poor	Medium	Intuitive distance
CvM	Excellent	Poor	High	Tail sensitivity
Anderson	Excellent	Poor	Highest	Extreme values
Hellinger	Good	Good	Medium	Bounded metric
Bhattacharyya	Good	Good	Medium	Classification bounds
TV	Good	Good	Medium	Max probability diff
Energy	Excellent	Poor	High	Location/scale
MMD	Excellent	Poor	High	High-dimensional

Use Cases¶

1. ML Model Monitoring¶

Detect feature drift in production:

# Compare training data to production
truthound compare training_data.csv production_data.csv --method psi --strict

2. Data Pipeline Validation¶

Validate daily data against baseline:

# In CI/CD pipeline
truthound compare baseline/data.csv daily/$(date +%Y%m%d).csv --strict

3. A/B Test Validation¶

Ensure test groups are comparable:

truthound compare control_group.csv treatment_group.csv -c demographic_features

4. Schema Evolution Detection¶

Detect changes in data distribution:

truthound compare last_week.parquet this_week.parquet --format json -o weekly_drift.json

CI/CD Integration¶

GitHub Actions¶

- name: Check Data Drift
  run: |
    truthound compare baseline.csv production.csv --strict --format json -o drift.json

- name: Upload Drift Report
  if: failure()
  uses: actions/upload-artifact@v4
  with:
    name: drift-report
    path: drift.json

Scheduled Monitoring¶

# Run daily drift check
schedule:
  - cron: '0 6 * * *'

jobs:
  drift-check:
    steps:
      - name: Download baseline
        run: aws s3 cp s3://bucket/baseline.parquet baseline.parquet

      - name: Download latest data
        run: aws s3 cp s3://bucket/latest.parquet latest.parquet

      - name: Check drift
        run: truthound compare baseline.parquet latest.parquet --strict

Exit Codes¶

Code	Condition
0	Success (no drift, or drift found without `--strict`)
1	Drift detected with `--strict` flag
2	Usage error (invalid arguments)

check - Validate data quality
profile - Generate data profile
ml drift - Advanced ML-based drift detection

truthound compare¶

Synopsis¶

Arguments¶

Options¶

Description¶

Examples¶

Basic Comparison¶

Specific Columns¶

Specific Method¶

Custom Threshold¶

Strict Mode (CI/CD)¶

JSON Output¶

Detection Methods¶

Automatic Selection (auto)¶

Kolmogorov-Smirnov (ks)¶

Population Stability Index (psi)¶

Chi-Squared (chi2)¶

Jensen-Shannon Divergence (js)¶

Kullback-Leibler Divergence (kl)¶

Wasserstein Distance (wasserstein)¶

Cramér-von Mises (cvm)¶

Anderson-Darling (anderson)¶

Hellinger Distance (hellinger)¶

Bhattacharyya Distance (bhattacharyya)¶

Total Variation Distance (tv)¶

Energy Distance (energy)¶

Maximum Mean Discrepancy (mmd)¶