truthound ml learn-rules
Learn validation rules from data using machine learning analysis.
Synopsis
truthound ml learn-rules <file> [OPTIONS]
Arguments
| Argument |
Required |
Description |
file |
Yes |
Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
Options
| Option |
Short |
Default |
Description |
--output |
-o |
learned_rules.json |
Output file path |
--strictness |
-s |
medium |
Rule strictness (loose, medium, strict) |
--min-confidence |
|
0.9 |
Minimum confidence threshold (0.0-1.0) |
--max-rules |
|
100 |
Maximum number of rules to generate |
Description
The ml learn-rules command automatically generates validation rules from data:
- Analyzes data patterns and distributions
- Infers constraints and relationships
- Generates validation rules with confidence scores
- Outputs rules in usable format
Learned Rule Types
| Rule Type |
Description |
Example |
not_null |
Column should not have nulls |
email has 0% nulls |
unique |
Column should be unique |
id has 100% unique values |
range |
Numeric bounds |
age between 0 and 120 |
pattern |
String format |
email matches email pattern |
allowed_values |
Categorical values |
status in [active, inactive] |
dtype |
Data type |
price is Float64 |
correlation |
Column relationships |
total = quantity * price |
Examples
Basic Rule Learning
truthound ml learn-rules data.csv
Output:
Learning Validation Rules
=========================
File: data.csv
Rows: 10,000
Columns: 8
Analyzing patterns...
Generating rules...
Learned Rules: 15
Rules by Category
─────────────────────────────────────────────────────────
Category Count Confidence Range
─────────────────────────────────────────────────────────
completeness 3 0.95 - 1.00
uniqueness 2 0.99 - 1.00
range 4 0.92 - 0.98
format 3 0.94 - 0.99
allowed_values 3 0.97 - 1.00
─────────────────────────────────────────────────────────
Top Rules (by confidence):
1. [1.00] id: unique
2. [1.00] created_at: not_null
3. [0.99] email: pattern(email)
4. [0.98] age: range(0, 120)
5. [0.97] status: allowed_values([active, inactive, pending])
Output: learned_rules.json
Custom Output Path
truthound ml learn-rules data.csv -o validation_rules.json
Strictness Levels
# Loose: Fewer rules, higher tolerance
truthound ml learn-rules data.csv --strictness loose
# Medium (default): Balanced rules
truthound ml learn-rules data.csv --strictness medium
# Strict: More rules, tighter constraints
truthound ml learn-rules data.csv --strictness strict
Confidence Threshold
# Only high-confidence rules (>= 95%)
truthound ml learn-rules data.csv --min-confidence 0.95
# Include lower-confidence rules
truthound ml learn-rules data.csv --min-confidence 0.8
Limit Number of Rules
# Generate at most 50 rules
truthound ml learn-rules data.csv --max-rules 50
# Generate comprehensive ruleset
truthound ml learn-rules data.csv --max-rules 200
JSON Output (default)
{
"source_file": "data.csv",
"generated_at": "2024-01-15T10:30:00Z",
"strictness": "medium",
"min_confidence": 0.9,
"summary": {
"total_rules": 15,
"avg_confidence": 0.96
},
"rules": [
{
"id": "rule_001",
"type": "not_null",
"column": "id",
"confidence": 1.0,
"severity": "critical",
"evidence": {
"null_count": 0,
"null_ratio": 0.0
}
},
{
"id": "rule_002",
"type": "unique",
"column": "id",
"confidence": 1.0,
"severity": "critical",
"evidence": {
"unique_count": 10000,
"unique_ratio": 1.0
}
},
{
"id": "rule_003",
"type": "range",
"column": "age",
"confidence": 0.98,
"severity": "high",
"parameters": {
"min_value": 0,
"max_value": 120
},
"evidence": {
"actual_min": 18,
"actual_max": 85,
"buffer_applied": true
}
},
{
"id": "rule_004",
"type": "pattern",
"column": "email",
"confidence": 0.99,
"severity": "high",
"parameters": {
"pattern": "email"
},
"evidence": {
"match_ratio": 0.99,
"sample_matches": ["john@example.com", "jane@test.org"]
}
},
{
"id": "rule_005",
"type": "allowed_values",
"column": "status",
"confidence": 0.97,
"severity": "medium",
"parameters": {
"values": ["active", "inactive", "pending"]
},
"evidence": {
"observed_values": ["active", "inactive", "pending"],
"value_counts": {
"active": 6000,
"inactive": 3500,
"pending": 500
}
}
}
]
}
Strictness Levels
| Level |
Description |
Rule Generation |
loose |
Permissive rules |
Wider ranges, fewer constraints |
medium |
Balanced rules |
Reasonable buffers applied |
strict |
Tight rules |
Close to observed data |
Example: Range Rule by Strictness
For data with age values 18-85:
| Strictness |
Generated Range |
Buffer |
| loose |
0-150 |
±100% |
| medium |
0-120 |
±40% |
| strict |
15-90 |
±10% |
Use Cases
1. Bootstrap Validation
# Generate rules from reference data
truthound ml learn-rules reference_data.csv -o rules.json --strictness medium
# Use rules for validation
truthound check new_data.csv --rules rules.json
2. Schema Discovery
# Discover schema from unknown data
truthound ml learn-rules unknown_data.csv -o schema_rules.json --strictness loose
3. Continuous Rule Refinement
# Learn rules from production data periodically
truthound ml learn-rules weekly_data.csv -o rules_$(date +%Y%m%d).json --min-confidence 0.95
4. CI/CD Integration
# GitHub Actions
- name: Learn and Validate
run: |
# Learn rules from baseline
truthound ml learn-rules baseline.csv -o rules.json --strictness medium
# Validate new data against learned rules
truthound check new_data.csv --rules rules.json --strict
5. Documentation Generation
# Generate rules with high confidence for documentation
truthound ml learn-rules production_data.csv -o data_contract.json --strictness strict --min-confidence 0.98
Comparison: ml learn-rules vs generate-suite
| Feature |
ml learn-rules |
generate-suite |
| Input |
Data file |
Profile file |
| Approach |
ML-based learning |
Profile-based generation |
| Speed |
Slower (analyzes data) |
Faster (uses profile) |
| Output |
JSON |
YAML, JSON, Python, TOML |
| Customization |
strictness, confidence |
Many presets, categories |
Exit Codes
| Code |
Condition |
| 0 |
Success |
| 1 |
Error (invalid arguments, file not found, or other error) |
See Also