Validations¶
The Validations module constitutes the primary mechanism for executing data quality checks and inspecting their results. It encompasses the full lifecycle of a validation run -- from configuration and execution through to result analysis, versioning, and historical trend monitoring.
Overview¶
Data validation is defined as the process of systematically verifying that a dataset conforms to a prescribed set of quality rules. Truthound Dashboard supports 289+ built-in validators spanning schema integrity, completeness, uniqueness, distribution, string patterns, datetime formats, aggregate statistics, cross-table relationships, geospatial coordinates, drift detection, anomaly detection, and privacy compliance.
The validation system is composed of three interrelated subsystems:
| Subsystem | Purpose | Page |
|---|---|---|
| Validation Execution Framework | Run validators against a data source and inspect results | Source Detail |
| Validation History and Longitudinal Analysis | Track pass/fail trends over time with statistical analysis | History |
| Version Management and Snapshot Comparison | Create and compare snapshots of validation results | Version History |
Validation Execution Framework¶
Running a Validation¶
Validation is initiated from the Source Detail page. The execution workflow is structured as follows:
- Select Validators: The Validator Selector is employed to designate which validators are to be executed
- Configure Parameters: Validator-specific parameters (thresholds, patterns, column selections) are adjusted as required
- Execute: The validation process is launched by selecting "Configure & Run"
- Review Results: Detected issues are examined, grouped by severity classification
Validator Configuration¶
The Validator Selector exposes a comprehensive interface through which validators may be selected and configured:
| Feature | Description |
|---|---|
| Preset Templates | Quick-start configurations: All Validators, Quick Check, Schema Only, Data Quality |
| Category Filtering | Filter by 14 categories (schema, completeness, uniqueness, distribution, etc.) |
| Search | Find validators by name, description, or tag |
| Parameter Configuration | Type-specific input forms (text, number, select, boolean, column list) |
| Severity Override | Override the default severity level for any validator |
| Column Autocomplete | Schema-aware column selection with autocomplete |
Execution Parameters¶
The validation engine accommodates several execution-level parameters that govern how validators are applied. Parameters introduced through the Truthound core engine enhancement phases (PHASE 1–5) are denoted accordingly.
| Parameter | Description | Default | Phase |
|---|---|---|---|
| validators | List of validators to execute | All enabled | — |
| validator_config | Per-validator parameter overrides | None | — |
| min_severity | Minimum severity threshold for reporting | low | — |
| parallel | Enable parallel execution across validators | false | — |
| max_workers | Number of parallel worker threads | CPU count | — |
| pushdown | Enable SQL query pushdown for database sources | false | — |
| schema | Schema file path or Schema object for constraint validation | Auto-detected | — |
| auto_schema | Automatically learn schema before validation | false | — |
| result_format | Progressive detail level controlling result enrichment depth | summary | PHASE 1 |
| include_unexpected_rows | Include failure row DataFrame in SUMMARY or higher results | false | PHASE 1 |
| max_unexpected_rows | Maximum number of failure rows to return (1–10,000) | 1000 | PHASE 1 |
| catch_exceptions | Enable exception isolation mode; errors are captured in the report rather than aborting execution | true | PHASE 5 |
| max_retries | Maximum retry attempts for transient errors with exponential backoff (0–10) | 3 | PHASE 5 |
Result Format System (PHASE 1)¶
The result_format parameter implements a four-level progressive disclosure model that governs the granularity of validation output. This system was introduced to enable practitioners to balance the trade-off between diagnostic richness and computational overhead.
| Level | Description | Enrichment Phases |
|---|---|---|
| BOOLEAN_ONLY | Pass/fail determination only; bypasses all enrichment phases | Phase 1 (aggregates) |
| BASIC | Adds failure counts and sample values | Phases 1–2 (aggregates + samples) |
| SUMMARY | Adds value frequency distributions (default) | Phases 1–3 (aggregates + samples + value counts) |
| COMPLETE | Adds full failure row DataFrames and debug queries | Phases 1–4 (all enrichment phases) |
The selection of result_format directly influences both the computational cost of validation and the volume of data transmitted through the API response payload. For high-frequency automated validations, BOOLEAN_ONLY or BASIC are recommended to minimize latency.
Exception Isolation and Auto-Retry (PHASE 5)¶
The exception isolation subsystem provides fault-tolerant validation execution through a multi-tier recovery mechanism:
- Exception Capture: When
catch_exceptions=True, validator errors are caught and recorded asExceptionInfoobjects within the validation report, rather than propagating as unhandled exceptions that would abort the entire validation session. - Automatic Retry: Transient errors (timeout, connection failure) are automatically retried up to
max_retriestimes with exponential backoff. Permanent errors (configuration, data type mismatches) are not retried. - Three-Tier Fallback (Expression Batch Executor): When a batched expression set fails, the executor falls back to per-validator execution; if that also fails, it falls back to per-expression execution, thereby maximizing the number of successful validations in a single run.
- Failure Classification: Each exception is classified into one of four categories —
transient,permanent,configuration, ordata— enabling targeted remediation.
Validation Results¶
Upon completion, the validation result is comprised of the following fields. Fields introduced through the core engine enhancement phases are annotated with their corresponding phase identifiers.
| Field | Description | Phase |
|---|---|---|
| passed | Boolean indicating overall pass/fail status | — |
| total_issues | Total number of issues detected | — |
| has_critical | Whether any critical-severity issues were found | — |
| has_high | Whether any high-severity issues were found | — |
| issues | Detailed list of individual issues | — |
| execution_time | Duration of the validation run | — |
| validators_run | Number of validators that were executed | — |
| statistics | Aggregate validation statistics including success rate, issues by severity/column/validator, and most problematic columns | PHASE 2 |
| validator_execution_summary | Summary of validator execution states (executed, skipped, failed) with skip reasons | PHASE 4 |
| exception_summary | Aggregate exception statistics including retry counts, recovery rates, and circuit breaker trip counts | PHASE 5 |
Structured Validation Statistics (PHASE 2)¶
The statistics object provides a comprehensive analytical decomposition of the validation session, enabling multi-dimensional analysis of data quality:
| Statistic | Description |
|---|---|
| total_validations | Total number of individual validator executions |
| successful_validations | Number of validators that completed without issues |
| unsuccessful_validations | Number of validators that detected quality issues |
| success_percent | Percentage of successful validations |
| issues_by_severity | Distribution of issues across severity levels |
| issues_by_column | Distribution of issues across affected columns |
| issues_by_validator | Distribution of issues across validator types |
| most_problematic_columns | Ranked list of columns exhibiting the highest issue density |
Validator Execution Summary (PHASE 4)¶
When the Truthound core engine's Directed Acyclic Graph (DAG) execution is active, validators may be conditionally skipped based on dependency relationships. The validator_execution_summary provides transparency into this process:
| Field | Description |
|---|---|
| total_validators | Total number of validators in the execution plan |
| executed | Number of validators that completed execution |
| skipped | Number of validators that were bypassed due to dependency failures |
| failed | Number of validators that encountered runtime errors |
| skipped_details | Per-validator skip reasons (e.g., "Schema validation failed for column X") |
This mechanism ensures that downstream validators are not executed against data that has already been determined to violate prerequisite constraints, thereby reducing spurious issue reports and improving the signal-to-noise ratio of validation output.
Exception Summary (PHASE 5)¶
The exception_summary provides an aggregate view of system-level errors encountered during the validation session:
| Field | Description |
|---|---|
| total_exceptions | Total number of exceptions encountered |
| retried_count | Number of validations that were retried |
| recovered_count | Number of validations that succeeded after retry |
| permanent_failures | Number of validations that failed permanently |
| exceptions_by_type | Distribution of exceptions by type (e.g., TimeoutError, ConnectionError) |
| circuit_breaker_trips | Number of times the circuit breaker was triggered |
| skipped_by_dependency | Number of validators skipped due to dependency failure |
Issue Detail¶
Each detected issue is characterized by the following attributes. The enhanced issue model (PHASE 2 and PHASE 5) provides substantially richer diagnostic information.
| Attribute | Description | Phase |
|---|---|---|
| Column | Name of the affected column (if applicable) | — |
| Validator | Name of the validator that detected the issue | — |
| Severity | critical, high, medium, or low | — |
| Value | The specific value or statistic that triggered the issue | — |
| Description | Human-readable explanation of the issue | — |
| Row Count | Number of rows affected (for row-level issues) | — |
| validator_name | Canonical name of the validator class that generated this issue | PHASE 2 |
| success | Boolean indicating whether this specific validation passed | PHASE 2 |
| result | Structured ValidationDetail object containing quantitative metrics |
PHASE 2 |
| exception_info | Exception metadata for system-error issues (type, message, retry count, failure category) | PHASE 5 |
ValidationDetail Object (PHASE 2)¶
The result field on each issue contains a ValidationDetail object whose fields are progressively populated according to the selected result_format level:
| Field | Populated At | Description |
|---|---|---|
| element_count | BOOLEAN_ONLY+ | Total number of rows evaluated |
| missing_count | BOOLEAN_ONLY+ | Number of null/missing values |
| unexpected_count | BOOLEAN_ONLY+ | Number of rows failing the validation predicate |
| unexpected_percent | BOOLEAN_ONLY+ | Failure rate relative to total row count |
| unexpected_percent_nonmissing | BOOLEAN_ONLY+ | Failure rate excluding null rows |
| observed_value | BASIC+ | The specific observed metric value |
| partial_unexpected_list | BASIC+ | Sample of failing values |
| partial_unexpected_counts | SUMMARY+ | Value-frequency pairs for failing values |
| unexpected_rows | COMPLETE | Full DataFrame of failing rows (serialized as dict list) |
| debug_query | COMPLETE | Reproducible query for failure row extraction |
Severity Classification¶
Issues are classified into four severity levels, each corresponding to a distinct degree of data quality impact:
| Severity | Description | Visual |
|---|---|---|
| Critical | Data is fundamentally corrupt or unusable | Red |
| High | Significant quality issues that require attention | Orange |
| Medium | Moderate issues that may affect downstream processes | Yellow |
| Low | Minor issues or informational observations | Blue |
Validation History and Longitudinal Analysis¶
The History page facilitates longitudinal analysis of validation results for a given data source.
Temporal Controls¶
| Control | Options | Purpose |
|---|---|---|
| Period Selector | Last 7, 30, or 90 days | Define the time window for analysis |
| Granularity Selector | Hourly, daily, or weekly | Control the aggregation granularity of trend charts |
Summary Statistics¶
Four summary cards are presented, each displaying key metrics for the selected period:
| Metric | Description |
|---|---|
| Total Runs | Number of validation executions in the period |
| Success Rate | Percentage of validations that passed |
| Failure Rate | Percentage of validations that failed |
| Trend Direction | Whether quality is improving (↑) or declining (↓) |
Trend Visualization¶
The History page incorporates two chart types for the purposes of temporal analysis:
- Pass/Fail Rate Chart: A line chart displaying the validation success and failure rates over time at the selected granularity
- Issue Frequency Chart: A bar chart illustrating the frequency of different issue types, thereby enabling identification of recurring quality problems
Recent Validations Table¶
A paginated table enumerates individual validation runs, with each entry comprising the following fields:
| Column | Description |
|---|---|
| Timestamp | When the validation was executed |
| Status | Pass or fail |
| Issue Count | Number of issues detected |
| Duration | Execution time |
| Validators | Number of validators executed |
Version Management and Snapshot Comparison¶
The Version History page enables the creation and comparative analysis of validation result snapshots.
Version Creation¶
Validation results may be persisted as versioned snapshots for subsequent reference. This operation is initiated from the Validation Results page via the "Create Version" button. Each version captures the following attributes:
| Attribute | Description |
|---|---|
| Version Number | Auto-incremented sequential identifier |
| Timestamp | Creation time of the snapshot |
| Strategy | Versioning strategy (incremental, semantic, timestamp, gitlike) |
| Metadata | Validation configuration and execution context |
Versioning Strategies¶
| Strategy | Description | Example |
|---|---|---|
| Incremental | Simple sequential numbering | v1, v2, v3 |
| Semantic | Major.minor.patch version scheme | v1.0.0, v1.0.1, v1.1.0 |
| Timestamp | ISO 8601 timestamp-based identifiers | 2024-01-15T10:30:00 |
| Git-like | Short hash-based identifiers | a1b2c3d |
Version Comparison¶
The comparison feature permits side-by-side analysis of two validation versions:
- Two versions are selected from the timeline
- The comparison view highlights differences in:
- Overall pass/fail status
- Issue counts by severity
- Specific issues that appeared or disappeared
- Changes in validation configuration
Version Timeline¶
A visual timeline component renders all versions in chronological order, with version cards displaying:
- Version number and strategy badge
- Creation timestamp
- A concise summary of the validation state
User-Defined Validation Rules¶
The Rules page exposes a YAML-based interface for the definition of user-specified validation rules.
Rule Structure¶
Rules are expressed in YAML format and are organized into two sections:
columns:
email:
- not_null
- unique
- pattern: "^[a-zA-Z0-9+_.-]+@[a-zA-Z0-9.-]+$"
age:
- not_null
- min: 0
- max: 150
table:
- row_count_min: 100
- no_duplicate_rows
Rule Management¶
| Operation | Description |
|---|---|
| Create Rule | Define a new rule with name, description, and YAML definition |
| Edit Rule | Modify an existing rule's definition |
| Delete Rule | Remove a rule from the source |
| Activate Rule | Set a rule as the active validation rule for the source |
Cross-Module Integration Points¶
Schedule Integration¶
Validation configurations may be attached to Schedules for automated periodic execution. The schedule persists the complete validator configuration, thereby ensuring consistent execution across successive runs.
Notification Integration¶
Validation failures are capable of triggering Notifications through configured channels. Notification rules support filtering by severity, source, and validator type.
Alert Integration¶
Validation results are propagated into the Unified Alert system, where they may be correlated with drift, anomaly, and model monitoring alerts.
Recommended Operational Practices¶
| Practice | Recommendation |
|---|---|
| Start with Quick Check | The Quick Check preset is recommended for initial quality assessment |
| Customize Gradually | Additional validators should be progressively enabled as the data is better understood |
| Set Appropriate Severity | Default severity levels should be overridden to reflect business criticality |
| Version Important Results | Version snapshots should be created before and after major data changes |
| Monitor Trends | The History page should be utilized to identify quality regression patterns |
| Automate with Schedules | Scheduled validations are recommended for production data sources |
| Enable Parallel Execution | Parallel mode should be employed for large datasets with many validators |
API Reference¶
Validation Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/sources/{id}/validate |
POST | Run validation on a data source |
/validations |
GET | List validation results |
/validations/{id} |
GET | Get validation result details |
/validations/{id}/issues |
GET | Get issues for a validation |
History Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/history/sources/{id} |
GET | Get validation history for a source |
/history/sources/{id}/trends |
GET | Get trend data for a source |
Version Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/versioning/sources/{id} |
GET | List versions for a source |
/versioning/sources/{id} |
POST | Create a new version snapshot |
/versioning/sources/{id}/compare |
GET | Compare two versions |
/versioning/{version_id} |
GET | Get version details |
Rules Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/sources/{id}/rules |
GET | List rules for a source |
/sources/{id}/rules |
POST | Create a new rule |
/rules/{id} |
PUT | Update a rule |
/rules/{id} |
DELETE | Delete a rule |
/rules/{id}/activate |
POST | Set as active rule |
Truthound Core Engine Integration¶
The validation subsystem operates as a thin orchestration layer over the Truthound core engine (v1.3.0), which underwent a five-phase enhancement programme. The dashboard's backend adapter, converter, schemas, and frontend types were systematically extended in parallel to maintain full feature parity with each core engine phase. The following table summarises the integration scope:
| Dashboard Task | Core Phase | Integration Scope | Key Affected Files |
|---|---|---|---|
| Task 1 | PHASE 1 — Result Format | Parameter propagation through all layers | schemas/validation.py, truthound_adapter.py, services.py, validations.ts |
| Task 2 | PHASE 2 — Structured Results | Schema extension for ValidationDetail, ReportStatistics |
converters/truthound.py, truthound_adapter.py, schemas/validation.py, validations.ts |
| Task 3 | PHASE 3 — Metric Deduplication | No dashboard changes required (internal optimisation) | — |
| Task 4 | PHASE 4 — DAG Activation | Execution metadata (skipped validators, dependency graph) | truthound_adapter.py, schemas/validation.py, validations.ts |
| Task 5 | PHASE 5 — Exception Isolation | Exception transparency (capture, retry, circuit breaker) | converters/truthound.py, truthound_adapter.py, schemas/validation.py, validations.ts, services.py |
Backward Compatibility¶
All integration changes adhere to a strict backward compatibility protocol:
- Optional Fields: Every field introduced through the enhancement phases is declared as
Optionalwith aNonedefault value, ensuring that pre-existing database records and API consumers remain functional. - Defensive Access: The dashboard employs
getattr(issue, "field", default)patterns to safely access fields that may be absent in older Truthound versions. - Pydantic Extra Ignore:
model_config = ConfigDict(extra="ignore")is applied to all schema classes, ensuring that unknown fields from future Truthound versions are silently discarded rather than raising validation errors.
Data Flow Architecture¶
The validation data flow through the dashboard follows a six-stage pipeline:
Frontend (React/TypeScript)
└─ ValidationRunOptions → POST /validations/sources/{id}/validate
↓
Backend (FastAPI)
└─ api/validations.py → services.py → truthound_adapter.py
↓
└─ truthound_adapter.check() → th.check(**kwargs)
↓
└─ TruthoundResultConverter._convert_check_result() → CheckResult
↓
└─ CheckResult.to_dict() → validation.result_json (SQLite JSON column)
↓
└─ ValidationResponse.from_model() → JSON response → Frontend
The TruthoundResultConverter serves as the centralised translation boundary between the Truthound core engine's domain objects and the dashboard's Pydantic response models, ensuring that all five enhancement phases are uniformly represented in the API output.
Glossary¶
| Term | Definition |
|---|---|
| Validation Run | A single execution of one or more validators against a data source |
| Issue | A specific quality problem detected by a validator |
| Severity | Classification of an issue's impact (critical, high, medium, low) |
| Version Snapshot | A preserved copy of validation results for future reference |
| Pushdown | Optimization that executes validation logic directly on the database server |
| Validator Config | Per-validator parameter overrides applied during execution |
| Result Format | Progressive disclosure level controlling the granularity of validation output (PHASE 1) |
| ValidationDetail | Structured object containing quantitative validation metrics per issue (PHASE 2) |
| ReportStatistics | Aggregate statistical decomposition of a validation session (PHASE 2) |
| DAG Execution | Directed Acyclic Graph-based validator scheduling with dependency-driven conditional execution (PHASE 4) |
| Exception Isolation | Fault-tolerant execution mode that captures and classifies runtime errors without aborting the session (PHASE 5) |
| Circuit Breaker | Protective mechanism that suspends validator execution upon detecting repeated failures (PHASE 5) |