Model Monitoring¶
The Model Monitoring module implements comprehensive lifecycle monitoring for machine learning models deployed in production environments. This module is integrated with the truthound.ml.monitoring framework to provide enterprise-grade performance tracking, drift detection, quality metrics computation, and intelligent alerting capabilities.
Overview¶
Machine learning models necessitate continuous monitoring to ensure that predictive performance is sustained over time. In contrast to traditional software systems, ML models are susceptible to silent degradation resulting from data drift, concept drift, or alterations in the underlying data distribution. These challenges are addressed by the present module through the provision of:
- Performance Metrics Collection: Systematic tracking of latency, throughput, and error rate measurements
- Drift Detection: Application of statistical methods for the identification of distribution changes
- Quality Metrics: Quantitative assessment of classification and regression model performance
- Intelligent Alerting: Implementation of threshold-based, statistical, and trend-based alert rule evaluation
Theoretical Foundation¶
Statistical Characterization of Data Drift¶
Data drift is observed when the statistical properties of input data undergo temporal change. Multiple statistical tests derived from the truthound framework are employed by this module:
| Method | Mathematical Basis | Interpretation |
|---|---|---|
| PSI (Population Stability Index) | \(PSI = \sum_{i} (A_i - E_i) \times \ln(A_i / E_i)\) | <0.1 stable, 0.1-0.25 slight drift, >0.25 significant |
| KS (Kolmogorov-Smirnov) | $D_n = \sup_x | F_n(x) - F(x) |
| JS (Jensen-Shannon) | \(JS(P \| Q) = \frac{1}{2}KL(P \| M) + \frac{1}{2}KL(Q \| M)\) | Bounded [0,1], symmetric divergence |
| Wasserstein | \(W_p(P, Q) = \left(\inf_{\gamma \in \Gamma(P,Q)} \int \|x-y\|^p d\gamma(x,y)\right)^{1/p}\) | Earth Mover's Distance |
Concept Drift Detection Methodologies¶
Concept drift is characterized by a temporal change in the relationship between input features and the target variable. The module incorporates the following detection methods:
- DDM (Drift Detection Method): Error rate is monitored against warning and drift thresholds
- ADWIN (Adaptive Windowing): Window size is automatically adjusted based on change detection outcomes
- Page-Hinkley: A cumulative sum test is applied for the detection of gradual distributional changes
Quality Metrics Definitions¶
For classification models, the following metrics are computed: - Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\) - Precision: \(\frac{TP}{TP + FP}\) - Recall: \(\frac{TP}{TP + FN}\) - F1 Score: \(2 \times \frac{Precision \times Recall}{Precision + Recall}\)
For regression models, the following metrics are computed: - MAE: \(\frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\) - MSE: \(\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\) - RMSE: \(\sqrt{MSE}\)
Model Monitoring Interface¶
Aggregate Statistics Dashboard¶
The interface presents aggregate model monitoring metrics as summarized below:
| Metric | Description |
|---|---|
| Total Models | Count of registered models |
| Active Models | Models currently in production |
| Degraded Models | Models exhibiting performance degradation |
| Predictions (24h) | Total predictions across all models |
| Active Alerts | Unresolved model-related alerts |
| Models with Drift | Models where input/output drift detected |
| Average Latency | Mean inference latency across models |
Model Registration and Version Management¶
Registration of a New Model¶
- Click Register Model
- Complete the registration form across three tabs:
Basic Information Tab¶
| Field | Description | Required |
|---|---|---|
| Model Name | Unique identifier for the model | Yes |
| Version | Semantic version (e.g., 1.0.0) | Yes |
| Description | Model purpose and documentation | No |
| Metadata | Custom key-value pairs | No |
Configuration Tab¶
The configuration parameters correspond directly to the truthound MonitorConfig specification:
Feature Toggles
| Setting | Description | Default |
|---|---|---|
| Enable Performance Metrics | Track latency, throughput, error rates | Enabled |
| Enable Drift Detection | Monitor distribution changes using th.compare() |
Enabled |
| Enable Quality Metrics | Track accuracy, precision, recall, F1 (requires actual values) | Enabled |
Drift Detection Parameters (when enabled)
| Setting | Description | Default |
|---|---|---|
| Drift Method | Statistical method for drift detection | Auto |
| Drift Threshold | Score threshold for triggering alerts | 10% |
Available drift methods are enumerated below: - Auto: The optimal method is automatically selected based on column type - PSI: Population Stability Index (recommended for tabular data) - KS: Kolmogorov-Smirnov test (distribution comparison) - JS: Jensen-Shannon divergence (symmetric, bounded) - Wasserstein: Earth Mover's Distance (geometry-aware) - Chi-squared: Applicable to categorical features - KL: Kullback-Leibler divergence - Hellinger: Bounded distance metric
Collection Parameters
| Setting | Description | Default | Range |
|---|---|---|---|
| Batch Size | Metrics collection batch size | 100 | 1-10,000 |
| Retention Hours | Data retention period | 24 | 1-720 |
| Collection Interval | Metric collection frequency (seconds) | 60 | 1-3,600 |
| Alert Evaluation Interval | Rule evaluation frequency (seconds) | 30 | 1-3,600 |
Alerts Tab¶
The default alert rules that are automatically instantiated upon model registration are displayed: - High Latency Alert (P95 > 500ms) - Drift Detection Alert (exceeds configured threshold) - Error Rate Alert (> 5%)
Supported Model Types¶
| Type | Description | Key Metrics |
|---|---|---|
| Classification | Categorical prediction | Accuracy, precision, recall, F1 |
| Regression | Continuous value prediction | MAE, MSE, RMSE |
| Ranking | Ordered list generation | NDCG, MAP, MRR |
Metrics Inspection Tab¶
Examination of Model Metrics¶
- Select a model from the dropdown
- Choose time range (1h, 6h, 24h, 7d)
- Review metrics display and time-series charts
Performance Metrics¶
| Metric | Description | Applicable To |
|---|---|---|
| Accuracy | Correct predictions / total predictions | Classification |
| Precision | True positives / predicted positives | Classification |
| Recall | True positives / actual positives | Classification |
| F1 Score | Harmonic mean of precision and recall | Classification |
| MAE | Mean absolute error | Regression |
| MSE | Mean squared error | Regression |
| RMSE | Root mean squared error | Regression |
Operational Metrics¶
| Metric | Description |
|---|---|
| Latency (p50) | Median inference time |
| Latency (p95) | 95th percentile inference time |
| Latency (p99) | 99th percentile inference time |
| Predictions Count | Total predictions in period |
| Error Rate | Percentage of failed predictions |
| Throughput | Predictions per second |
Drift Detection¶
Application of truthound th.compare()¶
The drift detection capability is implemented through truthound's th.compare() function, which is employed to identify distribution changes between reference and current datasets.
Operational Workflow: 1. Select reference data source (baseline distribution) 2. Select current data source (production distribution) 3. Choose drift detection method 4. Review per-column drift scores
Interpretation Guidelines:
| PSI Score | Interpretation | Action |
|---|---|---|
| < 0.10 | No significant drift | Continue monitoring |
| 0.10 - 0.25 | Slight drift | Investigate root cause |
| > 0.25 | Significant drift | Consider model retraining |
Drift Alert Generation Mechanism¶
When the drift score exceeds the configured threshold, the following sequence is initiated: 1. An alert is created by the system with severity determined by score magnitude 2. The alert includes identification of drifted columns and their individual scores 3. The model status may be transitioned to "Degraded" if the score exceeds 0.3
Quality Metrics Assessment¶
Computation of Quality Metrics¶
Quality metrics are derived from predictions for which associated actual (ground truth) values have been recorded.
For Classification Models: - Binary versus multi-class classification is automatically detected - Accuracy is computed for all classification types - Precision, recall, and F1 are computed for binary classification
For Regression Models: - MAE (Mean Absolute Error) is computed - MSE (Mean Squared Error) is computed - RMSE (Root Mean Squared Error) is computed
Recording Predictions with Ground Truth Values¶
To enable quality metrics computation, predictions must be recorded with the actual field:
POST /model-monitoring/models/{id}/predictions
{
"features": {"amount": 150.0, "merchant_type": "online"},
"prediction": 0.85,
"actual": 1,
"latency_ms": 5.2
}
Alert Rule Configuration¶
Classification of Alert Rule Types¶
The module supports three distinct rule types, each corresponding to components within truthound's alerting framework:
Threshold-Based Rules¶
Threshold-based alerting is configured through the following parameters: - Metric Name: Target metric to be monitored - Threshold: Trigger value - Comparison: gt, lt, gte, lte, eq - Duration: Time period over which the condition must persist
Statistical Rules (Anomaly Detection Rules)¶
Anomaly-based alerting is performed using statistical methods: - Window Size: Sample size utilized for statistical computation - Std Devs: Number of standard deviations defining the threshold boundary - An alert is triggered when the metric value exceeds the expected statistical range
Trend-Based Rules¶
Trend-based alerting is designed for the detection of gradual changes: - Direction: "increasing" or "decreasing" - Slope Threshold: Minimum rate of change required for activation - Lookback Minutes: Time window employed for trend calculation - Linear regression is utilized to detect degradation trends
Exemplary Rule Configurations¶
| Rule | Type | Metric | Condition | Severity |
|---|---|---|---|---|
| Low Accuracy | threshold | accuracy | < 0.85 | High |
| High Latency | threshold | latency_p95 | > 500ms | Medium |
| Error Spike | statistical | error_rate | > 3 std devs | Critical |
| Drift Detected | threshold | drift_score | > 0.1 | High |
| Degrading Performance | trend | accuracy | decreasing, slope > 0.01 | Warning |
Alert Handler Configuration Tab¶
Supported Handler Types¶
The module supports handlers that correspond to truthound's alert handler framework:
| Handler | truthound Mapping | Use Case |
|---|---|---|
| Slack | SlackAlertHandler | Team notifications |
| Webhook | WebhookAlertHandler | External integrations |
| - | Stakeholder notifications | |
| PagerDuty | PagerDutyAlertHandler | On-call escalation |
Handler Configuration Parameters¶
Slack Handler¶
| Parameter | Description |
|---|---|
| Webhook URL | Slack incoming webhook URL |
| Channel | Target channel (optional override) |
| Mention | Users/groups to mention |
Webhook Handler¶
| Parameter | Description |
|---|---|
| URL | Webhook endpoint |
| Method | HTTP method (POST, PUT) |
| Headers | Custom HTTP headers |
PagerDuty Handler¶
| Parameter | Description |
|---|---|
| Routing Key | PagerDuty integration key |
| Severity Mapping | Map alert severity to PagerDuty severity |
Model Lifecycle Management¶
Status Transition Model¶
| Status | Color | Description | Automatic Transition |
|---|---|---|---|
| Active | Green | Operating within parameters | - |
| Paused | Gray | Monitoring suspended | Manual |
| Degraded | Yellow | Performance below threshold | When drift_score > 0.3 |
| Error | Red | Experiencing errors | On repeated failures |
Health Score Computation¶
The health score (0-100) is computed as a weighted composite of the following factors: - Drift score contribution (weighted) - Error rate contribution - Latency threshold violations - Active alert count
Integration with the truthound Framework¶
Component Mapping¶
| Dashboard Feature | truthound Component |
|---|---|
| Model Config | MonitorConfig |
| Performance Metrics | PerformanceCollector |
| Drift Detection | th.compare(), DriftCollector |
| Quality Metrics | QualityCollector |
| Threshold Rules | ThresholdRule |
| Statistical Rules | AnomalyRule |
| Trend Rules | TrendRule |
| Slack Alerts | SlackAlertHandler |
| Webhook Alerts | WebhookAlertHandler |
| PagerDuty Alerts | PagerDutyAlertHandler |
Drift Detection Methods Reference¶
| Method | Type | Best For | Notes |
|---|---|---|---|
| auto | - | General use | Selects optimal method per column |
| psi | Binned | Tabular data | Industry standard |
| ks | Distribution | Numeric columns | Sensitive to shape |
| js | Divergence | All types | Symmetric, bounded [0,1] |
| wasserstein | Distance | Numeric columns | Geometry-aware |
| chi2 | Statistical | Categorical | Chi-squared test |
| kl | Divergence | All types | Information-theoretic |
| cvm | Statistical | Numeric | Sensitive to tails |
| anderson | Statistical | Numeric | Most sensitive to tails |
| hellinger | Distance | All types | Bounded [0,1] |
| energy | Distance | Numeric | Location/scale sensitive |
| mmd | Kernel | High-dimensional | Maximum Mean Discrepancy |
API Reference¶
Model Management Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/model-monitoring/models |
GET | List registered models |
/model-monitoring/models |
POST | Register a new model |
/model-monitoring/models/{id} |
GET | Retrieve model details |
/model-monitoring/models/{id} |
PUT | Update model configuration |
/model-monitoring/models/{id} |
DELETE | Delete a model |
/model-monitoring/models/{id}/pause |
POST | Pause monitoring |
/model-monitoring/models/{id}/resume |
POST | Resume monitoring |
Metrics and Analysis Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/model-monitoring/models/{id}/metrics |
GET | Retrieve performance metrics |
/model-monitoring/models/{id}/quality-metrics |
GET | Retrieve quality metrics |
/model-monitoring/models/{id}/detect-drift |
POST | Run drift detection |
/model-monitoring/models/{id}/predictions |
POST | Record prediction |
Alert and Rule Management Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/model-monitoring/alerts |
GET | List model alerts |
/model-monitoring/alerts/{id}/acknowledge |
POST | Acknowledge alert |
/model-monitoring/alerts/{id}/resolve |
POST | Resolve alert |
/model-monitoring/rules |
GET | List alert rules |
/model-monitoring/rules |
POST | Create alert rule |
/model-monitoring/rules/{id} |
PUT | Update alert rule |
/model-monitoring/rules/{id} |
DELETE | Delete alert rule |
/model-monitoring/handlers |
GET | List alert handlers |
/model-monitoring/handlers |
POST | Create alert handler |
/model-monitoring/handlers/{id}/test |
POST | Test alert handler |
Dashboard Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/model-monitoring/overview |
GET | Retrieve monitoring overview |
/model-monitoring/models/{id}/dashboard |
GET | Model-specific dashboard |
Recommended Operational Practices¶
Monitoring Strategy¶
- Establish a baseline: Reference metrics should be established prior to production deployment
- Configure appropriate thresholds: Thresholds should be determined based on business requirements and historical data analysis
- Enable drift detection: Drift detection is considered essential for identifying silent model degradation
- Implement alerting: Alert handlers should be configured to ensure timely notification of operational issues
Drift Detection¶
- Select an appropriate method: PSI is recommended for general use; KS is preferred when distributional sensitivity is required
- Define reasonable thresholds: It is advisable to begin with conservative thresholds (0.1) and adjust based on observed drift patterns
- Monitor at the per-column level: Individual features contributing to drift should be identified
- Correlate with performance metrics: It should be noted that not all drift impacts model performance with equal magnitude
Alert Configuration¶
- Prioritize critical metrics: Alerting should be focused on metrics that directly impact business outcomes
- Mitigate alert fatigue: Thresholds should be calibrated to minimize false positive rates
- Employ trend-based rules: Trend rules are recommended for detecting gradual degradation before it reaches a critical state
- Configure escalation pathways: Critical alerts should be routed to the appropriate operational channels
Diagnostic and Troubleshooting Procedures¶
This section is reserved for the documentation of common diagnostic procedures, known failure modes, and their corresponding resolution strategies. Practitioners are advised to consult the truthound ML Module Documentation for framework-level troubleshooting guidance.
References¶
- truthound ML Module Documentation:
.truthound_docs/advanced/ml-anomaly.md - Statistical Drift Detection Methods: Population Stability Index (PSI), Kolmogorov-Smirnov Test
- Concept Drift Detection: Gama, J., et al. (2014). A survey on concept drift adaptation