Statistical Methods¶
This document provides a comprehensive reference for the statistical methods employed in Truthound for drift detection, anomaly detection, and distributional analysis.
Table of Contents¶
- Overview
- Drift Detection Methods
- Anomaly Detection Methods
- Distribution Analysis
- Statistical Thresholds
- Method Selection Guide
- Mathematical Foundations
- References
1. Overview¶
Truthound implements a suite of statistical methods for data quality validation:
| Category | Methods | Primary Use Case |
|---|---|---|
| Drift Detection | 14 methods (auto, ks, psi, chi2, js, kl, wasserstein, cvm, anderson, hellinger, bhattacharyya, tv, energy, mmd) | Distribution comparison between datasets |
| Anomaly Detection | Multiple methods | Outlier and anomaly identification |
| Distribution Analysis | Various methods | Statistical characterization |
All methods are optimized for Polars LazyFrame execution, enabling efficient processing of large-scale datasets.
2. Drift Detection Methods¶
2.1 Kolmogorov-Smirnov Test (KS)¶
The Kolmogorov-Smirnov test measures the maximum distance between two empirical cumulative distribution functions.
Mathematical Definition:
Where F₁ and F₂ are the empirical CDFs of the two samples.
Usage:
# KS test requires numeric columns
drift = th.compare(baseline, current, method="ks", columns=["age", "salary", "score"])
Note: KS test only works with numeric columns. For mixed data types, use
method="auto".
Characteristics:
| Aspect | Description |
|---|---|
| Column Type | Numeric only |
| Best For | Continuous numeric distributions |
| Sensitivity | Shape and location differences |
| Output | D-statistic (0-1), p-value |
| Threshold | p-value < 0.05 indicates drift |
Interpretation:
| p-value | Interpretation |
|---|---|
| p < 0.05 | Significant difference (reject null hypothesis) |
| p >= 0.05 | No significant difference |
2.2 Population Stability Index (PSI)¶
PSI quantifies how much a variable's distribution has shifted between two samples. Widely used in credit scoring and model monitoring.
Mathematical Definition:
Where Pᵢ and Qᵢ are the proportions in bin i for baseline and current distributions.
Usage:
# PSI requires numeric columns
drift = th.compare(baseline, current, method="psi", columns=["age", "salary", "score"])
Note: PSI only works with numeric columns. For mixed data types, use
method="auto".
Interpretation:
| PSI Value | Interpretation | Action |
|---|---|---|
| < 0.1 | No significant shift | None required |
| 0.1 - 0.25 | Moderate shift | Monitor closely |
| > 0.25 | Significant shift | Investigation required |
Characteristics: - Numeric columns only - Automatic decile binning - Smoothing applied to prevent division by zero - Industry standard for model monitoring
2.3 Chi-Square Test¶
The chi-square test assesses independence between observed and expected categorical frequencies.
Mathematical Definition:
Where Oᵢ is the observed frequency and Eᵢ is the expected frequency.
Usage:
Characteristics:
| Aspect | Description |
|---|---|
| Best For | Categorical variables |
| Output | χ²-statistic, p-value |
| Degrees of Freedom | k - 1 (where k is number of categories) |
2.4 Jensen-Shannon Divergence (JS)¶
Jensen-Shannon divergence is a symmetrized and smoothed version of KL divergence.
Mathematical Definition:
Where M = 0.5 × (P + Q).
Usage:
Interpretation:
| JS Value | Interpretation |
|---|---|
| JS ≈ 0 | Distributions are identical |
| JS < 0.1 | Very similar distributions |
| 0.1 <= JS < 0.3 | Moderate difference |
| JS >= 0.3 | Significant difference |
Properties: - Symmetric: JS(P||Q) = JS(Q||P) - Bounded: 0 <= JS <= 1 (when using log base 2) - Metric: Square root of JS is a valid distance metric
2.5 Kullback-Leibler Divergence (KL)¶
KL divergence measures information loss when approximating one distribution with another.
Mathematical Definition:
Usage:
# KL divergence requires numeric columns
drift = th.compare(baseline, current, method="kl", columns=["age", "salary", "score"])
Note: KL divergence only works with numeric columns. For categorical data or symmetric divergence, use
method="js"(Jensen-Shannon).
Interpretation:
| KL Value | Interpretation |
|---|---|
| KL ≈ 0 | Distributions are identical |
| KL < 0.1 | Very similar distributions |
| 0.1 <= KL < 0.2 | Moderate difference |
| KL >= 0.2 | Significant difference |
Properties: - Asymmetric: KL(P||Q) ≠ KL(Q||P) - Non-negative: KL >= 0, with KL = 0 iff P = Q - Unbounded: Can be infinite if Q(x) = 0 where P(x) > 0 - Numeric columns only
2.6 Wasserstein Distance (Earth Mover's Distance)¶
Wasserstein distance measures the minimum "work" required to transform one distribution into another.
Mathematical Definition (1D case):
Where F_P and F_Q are the cumulative distribution functions.
Usage:
# Wasserstein distance requires numeric columns
drift = th.compare(baseline, current, method="wasserstein", columns=["age", "salary", "score"])
Note: Wasserstein distance only works with numeric columns. The statistic is normalized by baseline standard deviation for comparability.
Interpretation:
| Normalized Wasserstein | Interpretation |
|---|---|
| W < 0.05 | Very similar distributions |
| 0.05 <= W < 0.1 | Minor shift |
| 0.1 <= W < 0.2 | Moderate shift |
| W >= 0.2 | Significant shift |
Characteristics: - Numeric columns only - Metric (satisfies triangle inequality) - Meaningful even when distributions have non-overlapping support - Interpretable as "work" needed to move probability mass - Normalized by baseline standard deviation for scale independence
2.7 Cramér-von Mises Test¶
Cramér-von Mises is an alternative to KS that integrates squared differences between CDFs.
Mathematical Definition:
Usage:
# Cramér-von Mises test requires numeric columns
drift = th.compare(baseline, current, method="cvm", columns=["age", "salary", "score"])
Note: Cramér-von Mises test only works with numeric columns and requires at least 2 samples in each dataset.
Characteristics:
| Aspect | Description |
|---|---|
| Column Type | Numeric only |
| Best For | Detecting differences in entire distribution shape |
| Sensitivity | More sensitive to tail differences than KS |
| Output | ω² statistic, p-value |
| Threshold | p-value < 0.05 indicates drift |
2.8 Anderson-Darling Test¶
Anderson-Darling test gives more weight to the tails of the distribution.
Mathematical Definition:
Usage:
# Anderson-Darling test requires numeric columns
drift = th.compare(baseline, current, method="anderson", columns=["age", "salary", "score"])
Note: Anderson-Darling test only works with numeric columns and requires at least 2 samples in each dataset.
Characteristics:
| Aspect | Description |
|---|---|
| Column Type | Numeric only |
| Best For | Detecting differences in distribution tails |
| Sensitivity | More sensitive to tail deviations than KS or CvM |
| Output | A² statistic, p-value |
| Threshold | p-value < 0.05 indicates drift |
Interpretation (based on critical values):
| p-value | Interpretation |
|---|---|
| p > 0.25 | No significant difference |
| 0.05 < p <= 0.25 | Weak evidence of difference |
| 0.01 < p <= 0.05 | Moderate evidence of difference |
| p <= 0.01 | Strong evidence of difference |
2.9 Hellinger Distance¶
Hellinger distance measures the similarity between two probability distributions with desirable metric properties.
Mathematical Definition:
Where pᵢ and qᵢ are probabilities for category/bin i.
Usage:
Interpretation:
| Hellinger Value | Interpretation |
|---|---|
| H = 0 | Distributions are identical |
| H < 0.1 | Very similar distributions |
| 0.1 <= H < 0.2 | Moderate difference |
| H >= 0.2 | Significant difference |
| H = 1 | Distributions have no overlap |
Characteristics:
| Aspect | Description |
|---|---|
| Column Type | Numeric and Categorical |
| Range | Bounded [0, 1] |
| Symmetry | Symmetric: H(P,Q) = H(Q,P) |
| Metric | True metric (satisfies triangle inequality) |
| Relationship | H(P,Q) = √(1 - BC(P,Q)) where BC is Bhattacharyya coefficient |
2.10 Bhattacharyya Distance¶
Bhattacharyya distance measures the overlap between two probability distributions, with connections to classification error bounds.
Mathematical Definition:
Usage:
Interpretation:
| Bhattacharyya Distance | Interpretation |
|---|---|
| D_B ≈ 0 | Distributions are identical |
| D_B < 0.1 | Very similar distributions |
| 0.1 <= D_B < 0.2 | Moderate difference |
| D_B >= 0.2 | Significant difference |
Characteristics:
| Aspect | Description |
|---|---|
| Column Type | Numeric and Categorical |
| Range | [0, +∞) |
| BC Coefficient | Bounded [0, 1] (reported in details) |
| Application | Related to Bayes classification error |
| Relationship | Related to Hellinger: H² = 1 - BC |
2.11 Total Variation Distance¶
Total Variation (TV) distance measures the maximum difference in probability between two distributions.
Mathematical Definition:
Usage:
drift = th.compare(baseline, current, method="tv")
# or
drift = th.compare(baseline, current, method="total_variation")
Interpretation:
| TV Value | Interpretation |
|---|---|
| TV = 0 | Distributions are identical |
| TV < 0.1 | Very similar distributions |
| 0.1 <= TV < 0.2 | Moderate difference |
| TV >= 0.2 | Significant difference |
| TV = 1 | Distributions have completely disjoint support |
Characteristics:
| Aspect | Description |
|---|---|
| Column Type | Numeric and Categorical |
| Range | Bounded [0, 1] |
| Symmetry | Symmetric: TV(P,Q) = TV(Q,P) |
| Metric | True metric (satisfies triangle inequality) |
| Interpretation | "Largest possible probability difference for any event" |
Relationship with Hellinger:
2.12 Energy Distance¶
Energy distance is a statistical distance that characterizes the equality of distributions and has desirable metric properties.
Mathematical Definition:
Where X, X' ~ P and Y, Y' ~ Q are independent samples.
Usage:
# Energy distance requires numeric columns
drift = th.compare(baseline, current, method="energy", columns=["age", "salary"])
Note: Energy distance only works with numeric columns.
Interpretation:
| Normalized Energy | Interpretation |
|---|---|
| E ≈ 0 | Distributions are identical |
| E < 0.1 | Very similar distributions |
| 0.1 <= E < 0.2 | Moderate difference |
| E >= 0.2 | Significant difference |
Characteristics:
| Aspect | Description |
|---|---|
| Column Type | Numeric only |
| Range | [0, +∞), normalized by pooled std |
| Metric | True metric (satisfies triangle inequality) |
| Consistency | E(P,Q) = 0 if and only if P = Q |
| Computational | O(n²) for exact, can subsample for efficiency |
2.13 Maximum Mean Discrepancy (MMD)¶
Maximum Mean Discrepancy is a kernel-based distance measure that compares distributions in a reproducing kernel Hilbert space (RKHS).
Mathematical Definition:
Where k is a kernel function (default: Gaussian RBF kernel).
Usage:
# MMD requires numeric columns
drift = th.compare(baseline, current, method="mmd", columns=["feature1", "feature2"])
Note: MMD only works with numeric columns.
Kernel Options (configurable via API):
| Kernel | Formula | Best For |
|---|---|---|
| RBF (default) | k(x,y) = exp(-γ‖x-y‖²) | General purpose |
| Linear | k(x,y) = x·y | Linear differences |
| Polynomial | k(x,y) = (1 + x·y)² | Non-linear patterns |
Interpretation:
| MMD Value | Interpretation |
|---|---|
| MMD ≈ 0 | Distributions are identical (in RKHS) |
| MMD < 0.1 | Very similar distributions |
| 0.1 <= MMD < 0.2 | Moderate difference |
| MMD >= 0.2 | Significant difference |
Characteristics:
| Aspect | Description |
|---|---|
| Column Type | Numeric only |
| Range | [0, +∞) |
| Non-parametric | No density estimation required |
| High-dimensional | Works well where density estimation fails |
| Bandwidth | Auto-selected via median heuristic or custom |
| Computational | O(n²), can subsample for efficiency |
Currently Available Methods Summary¶
| Method | th.compare() |
ML API | Column Type |
|---|---|---|---|
auto |
✅ | - | Any (auto-select) |
ks |
✅ | - | Numeric only |
psi |
✅ | ✅ | Numeric only |
chi2 |
✅ | - | Categorical |
js |
✅ | ✅ (jensen_shannon) |
Any |
kl |
✅ | - | Numeric only |
wasserstein |
✅ | ✅ | Numeric only |
cvm |
✅ | - | Numeric only |
anderson |
✅ | - | Numeric only |
hellinger |
✅ | - | Any |
bhattacharyya |
✅ | - | Any |
tv |
✅ | - | Any |
energy |
✅ | - | Numeric only |
mmd |
✅ | - | Numeric only |
3. Anomaly Detection Methods¶
3.1 Z-Score Method¶
Z-score identifies outliers based on standard deviations from the mean.
Mathematical Definition:
Usage:
from truthound.ml import ZScoreAnomalyDetector
detector = ZScoreAnomalyDetector(threshold=3.0)
detector.fit(df)
result = detector.detect(df)
Characteristics:
| Aspect | Description |
|---|---|
| Best For | Normally distributed data |
| Threshold | Typically |
| Assumption | Data is approximately Gaussian |
3.2 Interquartile Range (IQR)¶
IQR method uses quartiles to define outlier boundaries.
Mathematical Definition:
Usage:
Parameters:
| k Value | Description |
|---|---|
| 1.5 | Standard outliers |
| 3.0 | Extreme outliers |
Characteristics: - Distribution-free and robust to extreme values - Works best for symmetric or approximately symmetric distributions - Resistant to extreme outliers
3.3 Modified Z-Score (MAD)¶
Modified Z-score uses Median Absolute Deviation for robustness.
Mathematical Definition:
Usage:
Characteristics: - Highly resistant to outliers - Threshold: |M| > 3.5 typically
3.4 Isolation Forest¶
Isolation Forest isolates anomalies by random recursive partitioning.
Principle: Anomalies are easier to isolate and require fewer splits.
Anomaly Score:
Where: - h(x) is the path length for observation x - c(n) is the average path length for n samples - E(h(x)) is the expected path length
Usage:
from truthound.ml import IsolationForestDetector
detector = IsolationForestDetector(
contamination=0.1,
n_estimators=100
)
detector.fit(df)
result = detector.detect(df)
Interpretation:
| Score | Interpretation |
|---|---|
| s ≈ 1 | Anomaly |
| s ≈ 0.5 | Normal |
| s < 0.5 | Definitely normal |
Characteristics: - Linear time complexity O(n) - Effective for high-dimensional data
3.5 Local Outlier Factor (LOF)¶
LOF identifies anomalies based on local density deviation.
Algorithm: 1. Compute k-nearest neighbors for each point 2. Calculate local reachability density (LRD) 3. Compare LRD of a point to LRDs of its neighbors
Formula:
Usage:
Interpretation:
| LOF Value | Interpretation |
|---|---|
| LOF ≈ 1 | Normal (similar density to neighbors) |
| LOF > 1 | Lower density than neighbors (potential outlier) |
| LOF >> 1 | Significant outlier |
3.6 Mahalanobis Distance¶
Mahalanobis distance accounts for correlations between variables.
Mathematical Definition:
Where: - x is the observation vector - μ is the mean vector - Σ is the covariance matrix
Usage:
Characteristics: - Scale-invariant - Accounts for correlations between variables - D² follows a chi-square distribution with p degrees of freedom
3.7 Additional Anomaly Methods¶
| Method | Approach | Best For |
|---|---|---|
grubbs |
Grubbs' test | Single outliers in univariate data |
esd |
Generalized ESD | Multiple outliers in univariate data |
dbscan |
Density-based clustering | Arbitrary-shaped clusters |
svm |
One-Class SVM | Non-linear boundaries |
autoencoder |
Reconstruction error | High-dimensional, complex patterns |
percentile |
Percentile bounds | Simple threshold-based detection |
tukey |
Tukey fences | Robust statistical bounds |
3.8 Ensemble Methods¶
from truthound.ml import EnsembleAnomalyDetector
ensemble = EnsembleAnomalyDetector(
detectors=[zscore_detector, iqr_detector, iso_detector],
voting_strategy="majority" # or "unanimous", "any"
)
4. Distribution Analysis¶
4.1 Normality Tests¶
| Test | Method | Best For |
|---|---|---|
| Shapiro-Wilk | W statistic | Small samples (n < 5000) |
| D'Agostino-Pearson | Skewness + kurtosis | Medium samples |
| Kolmogorov-Smirnov | CDF comparison | Large samples |
| Anderson-Darling | Weighted CDF | General use |
4.2 Descriptive Statistics¶
profile = th.profile(df)
# Available statistics per column:
# - count, null_count, null_ratio
# - mean, std, variance
# - min, max, range
# - q25, median (q50), q75
# - skewness, kurtosis
# - unique_count, unique_ratio
4.3 Entropy and Information¶
Shannon Entropy:
Usage:
5. Statistical Thresholds¶
5.1 Default Thresholds¶
| Method | Threshold | Interpretation |
|---|---|---|
| Z-Score | 3.0 | > 3 standard deviations |
| IQR | 1.5 | Outside 1.5 × IQR |
| MAD | 3.5 | > 3.5 modified z-scores |
| LOF | 1.5 | LOF score > 1.5 |
| Isolation Forest | 0.1 | Top 10% anomaly scores |
| KS Test | 0.05 | p-value threshold |
| PSI | 0.25 | Significant drift threshold |
| Chi-Square | 0.05 | p-value threshold |
5.2 Threshold Configuration¶
# Drift detection with custom thresholds
drift = th.compare(
baseline, current,
method="psi",
threshold=0.1 # More sensitive threshold
)
# Anomaly detection with custom thresholds
from truthound.ml import ZScoreAnomalyDetector
detector = ZScoreAnomalyDetector(threshold=2.5) # More sensitive
6. Method Selection Guide¶
6.1 By Data Type¶
| Data Type | Drift Method | Anomaly Method |
|---|---|---|
| Continuous | KS, Wasserstein, Energy | Z-Score, IQR, Isolation Forest |
| Categorical | Chi-Square, JS, Hellinger, TV | Mode deviation, category frequency |
| Ordinal | KS, Wasserstein | IQR, percentile |
| High-dimensional | MMD, Energy | Isolation Forest, Autoencoder |
| Time Series | KS with windows | LOF, ARIMA residuals |
| Probability Distributions | Hellinger, Bhattacharyya, TV | - |
6.2 By Sample Size¶
| Sample Size | Recommended Methods |
|---|---|
| n < 100 | Exact tests, bootstrap |
| 100 < n < 10,000 | KS, Chi-Square, Z-Score |
| n > 10,000 | PSI, JS, Isolation Forest |
| n > 1,000,000 | Sampled methods, streaming |
6.3 By Sensitivity Requirements¶
| Requirement | Methods |
|---|---|
| High sensitivity | Anderson-Darling, MAD (low threshold), Energy |
| Balanced | KS, PSI, IQR, Hellinger, TV |
| Low false positives | Mahalanobis, Ensemble voting, Bhattacharyya |
| True metric needed | Hellinger, TV, Energy, MMD |
6.4 Decision Tree¶
Is data continuous?
├─ Yes
│ ├─ Normally distributed? → Z-Score for anomaly, KS for drift
│ └─ Skewed/Unknown? → IQR for anomaly, PSI for drift
└─ No (Categorical)
├─ Few categories (< 20)? → Chi-Square, mode deviation
└─ Many categories? → Frequency analysis, JS divergence
Is data high-dimensional?
├─ Yes (> 10 features) → Isolation Forest, MMD
└─ No → Standard univariate methods
Are there existing outliers?
├─ Yes → MAD, IQR (robust methods)
└─ No → Z-Score, Mahalanobis
7. Mathematical Foundations¶
7.1 Empirical Distribution Functions¶
The empirical CDF is defined as:
This forms the basis for KS, CvM, and Anderson-Darling tests.
7.2 Information Theory Basics¶
Entropy measures uncertainty:
Relative Entropy (KL Divergence) measures distribution difference:
7.3 Hypothesis Testing Framework¶
Statistical tests follow the framework:
- Null Hypothesis (H₀): No difference between distributions
- Alternative Hypothesis (H₁): Distributions differ
- Test Statistic: Computed from data
- p-value: Probability of observing test statistic under H₀
- Decision: Reject H₀ if p-value < α (typically 0.05)
7.4 Multiple Testing Correction¶
When testing multiple columns:
| Method | Formula | Use Case |
|---|---|---|
| Bonferroni | α' = α / n | Conservative, independent tests |
| Holm | Sequential adjustment | Less conservative |
| Benjamini-Hochberg | FDR control | Many tests, some false positives acceptable |
# Truthound applies Benjamini-Hochberg by default for multiple columns
drift = th.compare(baseline, current, correction="bh")
8. References¶
- Kolmogorov, A. N. (1933). "Sulla determinazione empirica di una legge di distribuzione"
- Smirnov, N. (1948). "Table for Estimating the Goodness of Fit of Empirical Distributions"
- Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest"
- Breunig, M. M., et al. (2000). "LOF: Identifying Density-Based Local Outliers"
- Iglewicz, B., & Hoaglin, D. C. (1993). "How to Detect and Handle Outliers"
- Kullback, S., & Leibler, R. A. (1951). "On Information and Sufficiency"
- Lin, J. (1991). "Divergence measures based on the Shannon entropy"
- Mahalanobis, P. C. (1936). "On the generalized distance in statistics"
- Vaserstein, L. N. (1969). "Markov processes over denumerable products of spaces"
- Tukey, J. W. (1977). "Exploratory Data Analysis"
- Pearson, K. (1900). "On the criterion that a given system of deviations..."
- Hellinger, E. (1909). "Neue Begründung der Theorie quadratischer Formen..."
- Bhattacharyya, A. (1943). "On a measure of divergence between two statistical populations"
- Gretton, A., et al. (2012). "A Kernel Two-Sample Test" (MMD)
- Székely, G. J., & Rizzo, M. L. (2004). "Testing for equal distributions in high dimension"
See Also¶
- Validators Reference — All validator implementations
- Advanced Features — ML module documentation
- API Reference — Complete API documentation