truthound ml drift¶
Detect data drift between two datasets using machine learning methods.
Synopsis¶
Arguments¶
| Argument | Required | Description |
|---|---|---|
baseline |
Yes | Path to the baseline (reference) data file (CSV, JSON, Parquet, NDJSON, JSONL) |
current |
Yes | Path to the current data file to compare (CSV, JSON, Parquet, NDJSON, JSONL) |
Options¶
| Option | Short | Default | Description |
|---|---|---|---|
--method |
-m |
feature |
Detection method (distribution, feature, multivariate) |
--threshold |
-t |
0.1 |
Drift threshold (0.0-1.0) |
--columns |
All | Columns to compare (comma-separated) | |
--output |
-o |
None | Output file path |
Description¶
The ml drift command detects changes in data distribution using ML methods:
- Compares statistical properties between datasets
- Detects significant distribution shifts
- Reports drift scores per column
- Identifies most affected features
Detection Methods¶
Distribution (distribution)¶
Compares individual column distributions.
- Tests: KS-test (numeric), Chi-squared (categorical)
- Best for: Feature-level drift detection
- Speed: Fast
- Interpretability: High
Feature (feature)¶
Statistical feature comparison with multiple tests.
- Tests: Mean/std shift, distribution tests, correlation changes
- Best for: ML feature monitoring
- Speed: Fast
- Interpretability: High
Multivariate (multivariate)¶
Detects drift in the joint distribution.
- Algorithm: Compares multivariate distributions
- Best for: Correlated features, complex drift patterns
- Speed: Slower
- Advantage: Catches drift that single-column tests miss
Examples¶
Basic Drift Detection¶
Output:
ML Drift Detection Report
=========================
Baseline: baseline.csv (10,000 rows)
Current: current.csv (12,000 rows)
Method: distribution
Threshold: 0.1
Overall Drift Score: 0.23 (DRIFT DETECTED)
Column Results
──────────────────────────────────────────────────────────────────
Column Score Threshold Drift Detected Details
──────────────────────────────────────────────────────────────────
age 0.05 0.10 No -
income 0.34 0.10 Yes ⚠️ Mean shift: +15%
category 0.08 0.10 No -
region 0.42 0.10 Yes ⚠️ New values: 3
score 0.12 0.10 Yes ⚠️ Std shift: +22%
──────────────────────────────────────────────────────────────────
Drift Summary:
Total Columns: 5
Drifted Columns: 3 (60%)
Status: DRIFT DETECTED
Recommendations:
- Review 'income' for mean shift (+15%)
- Check 'region' for new category values
- Investigate 'score' distribution change
Specific Columns¶
Compare only selected columns:
Custom Threshold¶
Adjust drift sensitivity:
# More sensitive (lower threshold)
truthound ml drift baseline.csv current.csv --threshold 0.05
# Less sensitive (higher threshold)
truthound ml drift baseline.csv current.csv --threshold 0.2
Multivariate Detection¶
Detect complex drift patterns:
Output:
ML Drift Detection Report
=========================
Method: multivariate
Multivariate Drift Analysis
──────────────────────────────────────────────────────────────────
Metric Value Threshold Status
──────────────────────────────────────────────────────────────────
Overall Drift Score 0.28 0.10 DRIFT DETECTED
Correlation Change 0.15 0.10 Changed
Covariance Shift 0.22 0.10 Shifted
Principal Component Drift 0.18 0.10 Drifted
──────────────────────────────────────────────────────────────────
Most Affected Feature Combinations:
1. income × tenure: correlation changed from 0.72 to 0.45
2. age × salary: distribution shift detected
3. region × category: joint distribution changed
JSON Output¶
Output file (drift_report.json):
{
"baseline": {
"file": "baseline.csv",
"rows": 10000
},
"current": {
"file": "current.csv",
"rows": 12000
},
"method": "distribution",
"threshold": 0.1,
"overall_drift_score": 0.23,
"has_drift": true,
"column_results": [
{
"column": "age",
"score": 0.05,
"has_drift": false,
"details": null
},
{
"column": "income",
"score": 0.34,
"has_drift": true,
"details": {
"type": "mean_shift",
"baseline_mean": 50000,
"current_mean": 57500,
"percent_change": 15.0
}
},
{
"column": "region",
"score": 0.42,
"has_drift": true,
"details": {
"type": "new_categories",
"new_values": ["west", "southwest", "northwest"],
"count": 3
}
}
],
"summary": {
"total_columns": 5,
"drifted_columns": 3,
"drift_ratio": 0.6
}
}
Method Comparison¶
| Method | Speed | Complexity | Correlated Features | Best For |
|---|---|---|---|---|
| distribution | Fast | Low | No | Quick checks |
| feature | Fast | Medium | No | ML monitoring |
| multivariate | Slow | High | Yes | Complex drift |
Comparison: ml drift vs compare¶
| Feature | ml drift |
compare |
|---|---|---|
| Focus | ML-oriented drift detection | Statistical testing |
| Methods | distribution, feature, multivariate | ks, psi, chi2, js |
| Multivariate | Yes | No |
| Speed | Varies | Fast |
| Best for | ML pipelines | General use |
Use Cases¶
1. ML Model Monitoring¶
# Check if production data has drifted from training
truthound ml drift training_data.csv production_data.csv --method multivariate --threshold 0.1
2. Feature Store Monitoring¶
# Monitor feature drift daily
truthound ml drift features_yesterday.parquet features_today.parquet --method feature
3. A/B Test Validation¶
# Ensure test groups are comparable
truthound ml drift control_group.csv treatment_group.csv --columns demographics
4. CI/CD Pipeline¶
# GitHub Actions
- name: Check Feature Drift
run: |
truthound ml drift baseline.csv current.csv --method multivariate --threshold 0.15 --output drift.json
# Parse result and fail if drift detected
python -c "
import json
with open('drift.json') as f:
result = json.load(f)
if result['has_drift']:
print(f'Drift detected! Score: {result[\"overall_drift_score\"]}')
for col in result['column_results']:
if col['has_drift']:
print(f' - {col[\"column\"]}: {col[\"score\"]}')
exit(1)
"
5. Retraining Trigger¶
# Check drift and trigger retraining if needed
if truthound ml drift train.csv prod.csv --method multivariate --threshold 0.2; then
echo "No significant drift"
else
echo "Drift detected, triggering retraining"
python retrain_model.py
fi
Exit Codes¶
| Code | Condition |
|---|---|
| 0 | Success |
| 1 | Error (invalid arguments, file not found, or other error) |
Note: Drift detection results are reported in the output, but do not affect the exit code. Use
--output result.jsonto save JSON output and parse thehas_driftfield for CI/CD decisions.
Related Commands¶
compare- Statistical drift detectionml anomaly- Anomaly detectionml learn-rules- Learn validation rules