truthound ml anomaly¶

Detect anomalies in data using statistical and machine learning methods.

Synopsis¶

truthound ml anomaly <file> [OPTIONS]

Arguments¶

Argument	Required	Description
`file`	Yes	Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL)

Options¶

Option	Short	Default	Description
`--method`	`-m`	`zscore`	Detection method (zscore, iqr, mad, isolation_forest)
`--contamination`	`-c`	`0.1`	Expected anomaly ratio (0.0-0.5)
`--columns`		All numeric	Columns to analyze (comma-separated)
`--sample`	`-s`	All rows	Sample size for large datasets
`--output`	`-o`	None	Output file path
`--format`	`-f`	`console`	Output format (console, json)

Description¶

The ml anomaly command detects outliers and unusual data points:

Analyzes numeric columns for anomalies
Applies selected detection method
Reports anomalous rows with scores
Provides column-level statistics

Detection Methods¶

Z-Score (`zscore`)¶

Detects values that deviate significantly from the mean.

Formula: z = (x - μ) / σ
Threshold: Typically |z| > 3
Best for: Normally distributed data
Limitation: Sensitive to outliers in mean/std calculation

truthound ml anomaly data.csv --method zscore

Interquartile Range (`iqr`)¶

Detects values outside the IQR fence.

Formula: outlier if x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR
IQR: Q3 - Q1 (interquartile range)
Best for: Skewed distributions
Advantage: Robust to extreme outliers

truthound ml anomaly data.csv --method iqr

Median Absolute Deviation (`mad`)¶

Detects values using robust median-based statistics.

Formula: MAD = median(|x - median(x)|)
Threshold: |x - median| / MAD > 3.5
Best for: Data with extreme outliers
Advantage: Most robust to outliers

truthound ml anomaly data.csv --method mad

Isolation Forest (`isolation_forest`)¶

Machine learning-based anomaly detection.

Algorithm: Random forest that isolates anomalies
Advantage: Detects complex, multivariate anomalies
Best for: High-dimensional data, complex patterns
Parameter: contamination controls sensitivity

truthound ml anomaly data.csv --method isolation_forest --contamination 0.05

Examples¶

Basic Anomaly Detection¶

truthound ml anomaly data.csv

Output:

Anomaly Detection Results (zscore)
==================================================
Total points: 10000
Anomalies found: 847
Anomaly ratio: 8.47%
Threshold used: 3.0000

Top anomalies:
  Index 4521: score=0.9800, confidence=98.00%
  Index 1234: score=0.9500, confidence=95.00%
  Index 7890: score=0.9200, confidence=92.00%
  ...

Specific Columns¶

Analyze only selected columns:

truthound ml anomaly data.csv --columns age,salary,score

Custom Contamination¶

Adjust expected anomaly ratio:

# Expect 5% anomalies (more selective)
truthound ml anomaly data.csv --contamination 0.05

# Expect 20% anomalies (more inclusive)
truthound ml anomaly data.csv --contamination 0.2

Statistical Methods¶

# Z-score for normal distributions
truthound ml anomaly data.csv --method zscore

# IQR for skewed distributions
truthound ml anomaly data.csv --method iqr

# MAD for robust detection
truthound ml anomaly data.csv --method mad

JSON Output¶

truthound ml anomaly data.csv --format json -o anomalies.json

Output file (anomalies.json):

{
  "file": "data.csv",
  "row_count": 10000,
  "method": "isolation_forest",
  "contamination": 0.1,
  "summary": {
    "total_anomalies": 847,
    "anomaly_ratio": 0.0847
  },
  "column_statistics": [
    {
      "column": "age",
      "anomaly_count": 234,
      "anomaly_ratio": 0.0234,
      "max_score": 0.95
    },
    {
      "column": "salary",
      "anomaly_count": 456,
      "anomaly_ratio": 0.0456,
      "max_score": 0.98
    }
  ],
  "anomalies": [
    {
      "row_index": 4521,
      "score": 0.98,
      "anomalous_columns": ["salary"],
      "values": {"salary": 999999}
    },
    {
      "row_index": 1234,
      "score": 0.95,
      "anomalous_columns": ["age"],
      "values": {"age": -5}
    }
  ]
}

Method Comparison¶

Method	Speed	Robustness	Multivariate	Best For
zscore	Fast	Low	No	Normal distributions
iqr	Fast	Medium	No	Skewed data
mad	Fast	High	No	Data with outliers
isolation_forest	Medium	High	Yes	Complex patterns

Use Cases¶

1. Data Quality Check¶

# Find data entry errors
truthound ml anomaly customer_data.csv --method mad --format json -o errors.json

2. Fraud Detection¶

# Detect suspicious transactions
truthound ml anomaly transactions.csv --method isolation_forest --contamination 0.01

3. Sensor Data Monitoring¶

# Find sensor malfunctions
truthound ml anomaly sensor_readings.csv --columns temperature,pressure --method zscore

4. CI/CD Pipeline¶

# GitHub Actions
- name: Check for Data Anomalies
  run: |
    truthound ml anomaly data.csv --method isolation_forest --format json -o anomalies.json
    # Check if anomaly ratio is too high
    python -c "
    import json
    with open('anomalies.json') as f:
        data = json.load(f)
    if data['summary']['anomaly_ratio'] > 0.15:
        print('Too many anomalies detected!')
        exit(1)
    "

Performance¶

For large datasets, use the --sample option to process a random subset:

# Sample 100,000 rows for faster processing
truthound ml anomaly large_data.parquet --method isolation_forest --sample 100000

# Isolation Forest may be slow for very large files
# Consider using statistical methods for speed
truthound ml anomaly large_data.parquet --method iqr

# Combine sampling with any method
truthound ml anomaly large_data.parquet --method mad --sample 50000

Sampling uses a fixed seed (42) for reproducibility.

Exit Codes¶

Code	Condition
0	Success
1	Error (invalid arguments, file not found, or other error)

ml drift - Detect data drift
ml learn-rules - Learn validation rules
check - Rule-based validation