ML Commands
Machine learning-based data quality detection commands.
Overview
Command
Description
Primary Use Case
anomaly
Detect anomalies in data
Outlier detection
drift
Detect data drift
Model monitoring
learn-rules
Learn validation rules from data
Rule automation
What is ML-based Detection?
ML-based detection uses statistical and machine learning algorithms to:
Detect anomalies : Find outliers and unusual patterns
Detect drift : Identify distribution changes between datasets
Learn rules : Automatically generate validation rules from data
Workflow
graph LR
A[Data] --> B{Task?}
B -->|Outliers| C[ml anomaly]
B -->|Changes| D[ml drift]
B -->|Rules| E[ml learn-rules]
C --> F[Anomaly Report]
D --> G[Drift Report]
E --> H[Validation Rules]
Quick Examples
Anomaly Detection
# Detect outliers using Isolation Forest
truthound ml anomaly data.csv --method isolation_forest
# Detect outliers in specific columns
truthound ml anomaly data.csv --columns age,salary --method zscore
Drift Detection
# Detect distribution drift
truthound ml drift baseline.csv current.csv --method distribution
# Detect multivariate drift
truthound ml drift train.csv production.csv --method multivariate
Rule Learning
# Learn validation rules from data
truthound ml learn-rules data.csv -o rules.json --strictness medium
Detection Methods
Anomaly Detection Methods
Method
Description
Best For
zscore
Z-score based detection
Normal distributions
iqr
Interquartile range
Robust to outliers
mad
Median absolute deviation
Skewed distributions
isolation_forest
ML-based isolation
Complex patterns
Drift Detection Methods
Method
Description
Best For
distribution
Per-column distribution comparison
Feature drift
feature
Feature-wise statistical tests
ML features
multivariate
Multi-dimensional drift detection
Correlated features
Use Cases
1. Data Quality Monitoring
# Scheduled anomaly check
truthound ml anomaly daily_data.csv --method isolation_forest --format json -o anomalies.json
2. ML Model Monitoring
# Check for feature drift before retraining
truthound ml drift training_data.csv production_data.csv --method multivariate --threshold 0 .05
3. Automated Rule Generation
# Bootstrap validation rules from reference data
truthound ml learn-rules reference_data.csv -o rules.json --strictness strict
4. CI/CD Integration
# GitHub Actions
- name : Check for Data Drift
run : |
truthound ml drift baseline.csv current.csv --threshold 0.1
if [ $? -ne 0 ]; then
echo "Data drift detected!"
exit 1
fi
Method
Speed
Memory
Scalability
zscore
Fast
Low
Excellent
iqr
Fast
Low
Excellent
mad
Fast
Low
Excellent
isolation_forest
Medium
Medium
Good
distribution drift
Fast
Low
Excellent
multivariate drift
Slow
High
Limited
For large datasets, consider:
# Use sampling for large files
truthound ml anomaly large_data.parquet --method isolation_forest --sample 100000
Next Steps
See Also