ML Commands
Machine learning-based data quality detection commands.
Overview
| Command |
Description |
Primary Use Case |
anomaly |
Detect anomalies in data |
Outlier detection |
drift |
Detect data drift |
Model monitoring |
learn-rules |
Learn validation rules from data |
Rule automation |
What is ML-based Detection?
ML-based detection uses statistical and machine learning algorithms to:
- Detect anomalies: Find outliers and unusual patterns
- Detect drift: Identify distribution changes between datasets
- Learn rules: Automatically generate validation rules from data
Workflow
graph LR
A[Data] --> B{Task?}
B -->|Outliers| C[ml anomaly]
B -->|Changes| D[ml drift]
B -->|Rules| E[ml learn-rules]
C --> F[Anomaly Report]
D --> G[Drift Report]
E --> H[Validation Rules]
Quick Examples
Anomaly Detection
# Detect outliers using Isolation Forest
truthound ml anomaly data.csv --method isolation_forest
# Detect outliers in specific columns
truthound ml anomaly data.csv --columns age,salary --method zscore
Drift Detection
# Detect distribution drift
truthound ml drift baseline.csv current.csv --method distribution
# Detect multivariate drift
truthound ml drift train.csv production.csv --method multivariate
Rule Learning
# Learn validation rules from data
truthound ml learn-rules data.csv -o rules.json --strictness medium
Detection Methods
Anomaly Detection Methods
| Method |
Description |
Best For |
zscore |
Z-score based detection |
Normal distributions |
iqr |
Interquartile range |
Robust to outliers |
mad |
Median absolute deviation |
Skewed distributions |
isolation_forest |
ML-based isolation |
Complex patterns |
Drift Detection Methods
| Method |
Description |
Best For |
distribution |
Per-column distribution comparison |
Feature drift |
feature |
Feature-wise statistical tests |
ML features |
multivariate |
Multi-dimensional drift detection |
Correlated features |
Use Cases
1. Data Quality Monitoring
# Scheduled anomaly check
truthound ml anomaly daily_data.csv --method isolation_forest --format json -o anomalies.json
2. ML Model Monitoring
# Check for feature drift before retraining
truthound ml drift training_data.csv production_data.csv --method multivariate --threshold 0.05
3. Automated Rule Generation
# Bootstrap validation rules from reference data
truthound ml learn-rules reference_data.csv -o rules.json --strictness strict
4. CI/CD Integration
# GitHub Actions
- name: Check for Data Drift
run: |
truthound ml drift baseline.csv current.csv --threshold 0.1
if [ $? -ne 0 ]; then
echo "Data drift detected!"
exit 1
fi
| Method |
Speed |
Memory |
Scalability |
| zscore |
Fast |
Low |
Excellent |
| iqr |
Fast |
Low |
Excellent |
| mad |
Fast |
Low |
Excellent |
| isolation_forest |
Medium |
Medium |
Good |
| distribution drift |
Fast |
Low |
Excellent |
| multivariate drift |
Slow |
High |
Limited |
For large datasets, consider:
# Use sampling for large files
truthound ml anomaly large_data.parquet --method isolation_forest --sample 100000
Next Steps
See Also