Skip to content

Architecture Overview

This document provides a comprehensive overview of Truthound's internal architecture, design principles, and system structure.


Table of Contents

  1. Design Philosophy
  2. System Architecture
  3. Core Components
  4. Data Flow
  5. Validator Framework
  6. Execution Model
  7. Extension Points
  8. Phase Overview
  9. Performance Architecture
  10. Testing Architecture

1. Design Philosophy

Core Principles

Principle Description
Zero Configuration Immediate usability with sensible defaults; no boilerplate required
Performance First Polars LazyFrame architecture for efficient memory usage and computation
Type Safety Strong typing throughout with comprehensive runtime validation
Extensibility Modular architecture supporting custom validators, sources, and reporters
Composability Components designed for combination and reuse
Observability Rich output formats and detailed diagnostics

Architectural Constraints

  1. Polars Native: All core operations implemented using Polars for consistent performance
  2. Lazy Evaluation: Deferred computation until results are required
  3. Immutability: Data structures are immutable where possible
  4. Protocol-Based: Components interact through well-defined protocols
  5. Fail-Fast: Validation errors raised immediately with clear context

2. System Architecture

High-Level Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              User Interface                                   │
│  ┌──────────────────────────────────┐  ┌─────────────────────────────────┐  │
│  │         Python API               │  │             CLI                 │  │
│  │  th.check() th.scan() th.compare │  │  truthound check data.csv       │  │
│  └─────────────────┬────────────────┘  └───────────────┬─────────────────┘  │
└────────────────────┼───────────────────────────────────┼────────────────────┘
                     │                                   │
                     └─────────────────┬─────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Input Layer                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        Data Source Factory                           │    │
│  │  Polars │ Pandas │ Spark │ SQL │ BigQuery │ Snowflake │ Files      │    │
│  └────────────────────────────────┬────────────────────────────────────┘    │
└───────────────────────────────────┼─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Core Engine                                      │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │   Schema    │  │  Validator  │  │    Drift    │  │     PII     │        │
│  │  Inference  │  │   Engine    │  │   Engine    │  │   Scanner   │        │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
│         │                │                │                │                 │
│         └────────────────┼────────────────┼────────────────┘                 │
│                          ▼                ▼                                   │
│  ┌───────────────────────────────────────────────────────────────────┐      │
│  │                    Polars LazyFrame Processing                     │      │
│  └───────────────────────────────────────────────────────────────────┘      │
└────────────────────────────────────────────────────────────────────────┬────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│                              Output Layer                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                          Reporter Factory                            │    │
│  │     Console │ JSON │ HTML │ Markdown │ JUnit │ Stores (S3/GCS/DB)  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Module Structure

src/truthound/
├── __init__.py              # Public API exports
├── cli.py                   # CLI interface (Typer)
├── types.py                 # Type definitions
├── core/                    # Core validation logic
├── validators/              # 289 validator implementations across 28 categories
│   ├── base.py              # Validator base classes, ExpressionBatchExecutor, _validate_safe
│   ├── metrics.py           # VE-3: MetricKey, SharedMetricStore, CommonMetrics
│   ├── resilience_bridge.py # VE-5: ValidationResiliencePolicy (circuit breaker + retry)
│   ├── schema/              # Schema validators (10)
│   ├── completeness/        # Completeness validators (5)
│   ├── uniqueness/          # Uniqueness validators (6)
│   ├── distribution/        # Distribution validators (7)
│   ├── string/              # String validators (9)
│   ├── datetime/            # Datetime validators (6)
│   ├── aggregate/           # Aggregate validators (5)
│   ├── multi_column/        # Multi-column validators (5)
│   ├── query/               # Query validators (6)
│   ├── table/               # Table validators (6)
│   ├── geospatial/          # Geospatial validators (5)
│   ├── drift/               # Drift validators (5)
│   ├── anomaly/             # Anomaly validators (4)
│   ├── privacy/             # Privacy validators (5)
│   ├── business_rule/       # Business validators (3)
│   ├── localization/        # Localization validators (4)
│   ├── ml_feature/          # ML feature validators (5)
│   ├── profiling/           # Profiling validators (4)
│   ├── referential/         # Referential validators (5)
│   ├── timeseries/          # Time series validators (6)
│   ├── cross_table/         # Cross-table validators (2)
│   ├── streaming/           # Streaming validators (5)
│   ├── memory/              # Memory validators (4)
│   ├── optimization/        # Optimization validators (6)
│   ├── i18n/                # i18n validators (10)
│   ├── timeout/             # Timeout validators (4)
│   ├── security/            # Security validators (2)
│   └── sdk/                 # SDK validators (4)
├── datasources/             # Data source adapters
│   ├── base.py              # DataSource protocol
│   ├── polars_source.py     # Polars adapter
│   ├── pandas_source.py     # Pandas adapter
│   ├── spark_source.py      # Spark adapter
│   └── sql/                 # SQL adapters (SQLite, PostgreSQL, BigQuery, etc.)
├── execution/               # Execution engines
├── profiler/                # Auto-profiling system
├── checkpoint/              # Checkpoint & CI/CD system
├── stores/                  # Result storage backends
├── reporters/               # Output formatters
├── datadocs/                # HTML report generation
├── plugins/                 # Plugin architecture
├── ml/                      # ML module (anomaly, drift, rule learning)
├── lineage/                 # Data lineage tracking
└── realtime/                # Streaming validation

3. Core Components

3.1 DataSource

The DataSource abstraction provides a unified interface for accessing data from various backends.

from truthound.datasources import BaseDataSource

class DataSource(Protocol):
    """Protocol for data source implementations."""

    @property
    def schema(self) -> dict[str, ColumnType]:
        """Return column name to type mapping."""
        ...

    def to_lazyframe(self) -> pl.LazyFrame:
        """Convert to Polars LazyFrame for processing."""
        ...

    def get_execution_engine(self) -> ExecutionEngine:
        """Return execution engine for this source."""
        ...

    def needs_sampling(self) -> bool:
        """Check if data exceeds size limits."""
        ...

    def sample(self, n: int) -> DataSource:
        """Return sampled data source."""
        ...

3.2 ExecutionEngine

The ExecutionEngine handles actual validation operations with backend-specific optimizations.

from truthound.execution import ExecutionEngine

class ExecutionEngine(Protocol):
    """Protocol for execution engine implementations."""

    def count_rows(self) -> int:
        """Return total row count."""
        ...

    def count_nulls(self, column: str) -> int:
        """Return null count for column."""
        ...

    def count_distinct(self, column: str) -> int:
        """Return distinct value count."""
        ...

    def get_stats(self, column: str) -> dict:
        """Return statistical summary for column."""
        ...

    def count_matching(self, condition: str) -> int:
        """Return count matching condition."""
        ...

3.3 Validator

All validators inherit from the Validator base class and implement the validation protocol.

from truthound.validators.base import Validator

class Validator(ABC):
    """Base class for all validators."""

    name: str                    # Unique validator name
    category: str                # Validator category
    description: str             # Human-readable description

    @abstractmethod
    def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        """Execute validation and return issues."""
        ...

    @abstractmethod
    def get_config(self) -> dict:
        """Return validator configuration."""
        ...

3.4 ValidationIssue

@dataclass
class ValidationIssue:
    """Represents a single data quality issue found during validation."""

    # Core fields (always populated)
    column: str
    issue_type: str
    count: int
    severity: Severity

    # Legacy detail fields (backward compatible)
    details: str | None = None
    expected: Any | None = None
    actual: Any | None = None
    sample_values: list[Any] | None = None

    # VE-2: Structured validation result
    result: ValidationDetail | None = None     # GX-style structured detail
    validator_name: str | None = None
    success: bool = False

    # VE-5: Exception context
    exception_info: ExceptionInfo | None = None

3.5 ValidationDetail (VE-2)

Structured detail for a single validation result, modeled after GX ExpectationValidationResult.result.

@dataclass
class ValidationDetail:
    """Structured detail with 4-phase progressive enrichment."""

    # BOOLEAN_ONLY (always populated)
    element_count: int = 0
    missing_count: int = 0

    # BASIC and above
    observed_value: Any = None
    unexpected_count: int = 0
    unexpected_percent: float = 0.0
    partial_unexpected_list: list[Any] | None = None

    # SUMMARY and above
    partial_unexpected_counts: list[dict[str, Any]] | None = None
    partial_unexpected_index_list: list[int] | None = None

    # COMPLETE only
    unexpected_list: list[Any] | None = None
    unexpected_rows: pl.DataFrame | None = None
    debug_query: str | None = None

3.6 ExceptionInfo (VE-5)

Rich exception context with automatic classification.

@dataclass
class ExceptionInfo:
    """Detailed exception information for validation failures."""

    raised_exception: bool = False
    exception_type: str | None = None
    exception_message: str | None = None
    exception_traceback: str | None = None
    retry_count: int = 0
    max_retries: int = 0
    is_retryable: bool = False
    validator_name: str | None = None
    failure_category: str = "unknown"  # transient | permanent | configuration | data

3.7 Reporter

Reporters transform validation results into various output formats.

from truthound.reporters.base import ValidationReporter

class ValidationReporter(Protocol[C]):
    """Protocol for reporter implementations."""

    name: str                    # Reporter name
    file_extension: str          # Output file extension

    def render(self, data: ValidationResult) -> str:
        """Render result to string format."""
        ...

    def save(self, data: ValidationResult, path: Path) -> None:
        """Save result to file."""
        ...

3.8 Store

Stores persist validation results and expectations.

from truthound.stores import get_store, ValidationResult

# Create store instance
store = get_store("filesystem", base_path=".truthound/results")

# Save result
result = ValidationResult.from_report(report, "customers.csv")
run_id = store.save(result)

# Retrieve result
result = store.get(run_id)

Available Backends:

Backend Package Description
filesystem (built-in) Local filesystem storage
memory (built-in) In-memory storage for testing
s3 boto3 AWS S3 storage
gcs google-cloud-storage Google Cloud Storage
database sqlalchemy SQL database storage

4. Data Flow

Validation Flow

Input                    Processing                     Output
─────                    ──────────                     ──────

Data Source         ┌─────────────────┐
(CSV, Parquet,  ───►│  Input Adapter  │
 DataFrame, SQL)    └────────┬────────┘
                    ┌─────────────────┐
                    │  LazyFrame      │  (Polars lazy evaluation)
                    └────────┬────────┘
                    ┌────────┼────────┐
                    ▼        ▼        ▼
              ┌──────────┬──────────┬──────────┐
              │ Schema   │ Pattern  │ Statist- │
              │ Valid-   │ Valid-   │ ical     │  (Parallel execution)
              │ ators    │ ators    │ Valid.   │
              └────┬─────┴────┬─────┴────┬─────┘
                   │          │          │
                   └──────────┼──────────┘
                    ┌─────────────────┐
                    │ Issue Collector │
                    └────────┬────────┘
                    ┌─────────────────┐            ┌──────────┐
                    │ ValidationResult│───────────►│ Reporter │───► Output
                    └─────────────────┘            └──────────┘

Drift Detection Flow

Baseline Data           Current Data
     │                       │
     ▼                       ▼
┌─────────────┐        ┌─────────────┐
│  LazyFrame  │        │  LazyFrame  │
└──────┬──────┘        └──────┬──────┘
       │                      │
       └──────────┬───────────┘
         ┌─────────────────┐
         │ Column Sampling │  (Optional)
         └────────┬────────┘
         ┌────────┼────────┐
         ▼        ▼        ▼
   ┌──────────┬──────────┬──────────┐
   │  KS Test │ PSI      │ Chi-Sq   │  (Method selection)
   └────┬─────┴────┬─────┴────┬─────┘
        │          │          │
        └──────────┼──────────┘
          ┌─────────────────┐
          │  Drift Report   │
          └─────────────────┘

5. Validator Framework

Validator Categories

Validators are organized into 28 categories based on their validation focus:

Category Count Focus
Schema 15 Column structure, types, relationships
Completeness 12 Null detection, required fields
Uniqueness 17 Duplicates, primary keys
Distribution 15 Range, outliers, statistics
String 19 Patterns, formats, encoding
Datetime 10 Format, range, sequence
Aggregate 8 Statistical constraints
Cross-table 5 Multi-table relationships
Multi-column 21 Column comparisons
Query 20 Expression-based validation
Table 18 Metadata, freshness
Geospatial 13 Coordinates, boundaries
Drift 14 Distribution changes
Anomaly 18 Outlier detection
Privacy 16 PII detection, GDPR/CCPA
Business Rule 8 Business rules (Luhn, IBAN)
Localization 9 Regional formats
ML Feature 5 Feature quality
Profiling 7 Data characteristics
Referential 14 Foreign key integrity
Time Series 14 Temporal patterns
Streaming 12 Stream validation
Memory 8 Memory-efficient validation
Optimization 15 DAG execution, profiling
SDK 80 Custom validator development
Security 3 ReDoS protection, SQL injection
i18n 3 Internationalized error messages
Timeout - Distributed timeout handling

Validator Registration

Validators are automatically registered using decorators:

from truthound.validators.base import register_validator

@register_validator("null_check")
class NullCheckValidator(Validator):
    """Check for null values in columns."""

    name = "null_check"
    category = "completeness"

    def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        # Implementation
        ...

Validator Discovery

from truthound.validators import get_validator, list_validators

# Get specific validator
validator = get_validator("null_check")

# List all validators
all_validators = list_validators()

# List by category
completeness_validators = list_validators(category="completeness")

6. Execution Model

Lazy Evaluation

Truthound leverages Polars' lazy evaluation for efficient processing:

  1. Plan Construction: Validation operations build a query plan
  2. Optimization: Polars optimizes the plan (predicate pushdown, projection)
  3. Execution: Plan executed only when results are collected
# Query plan is built but not executed
lf = pl.scan_csv("large_file.csv")

# Validators add operations to plan
validator.validate(lf)

# Execution happens on collect()
issues = validator.validate(lf)  # Executes optimized plan

Parallel Execution

Multiple validators can execute concurrently:

from concurrent.futures import ThreadPoolExecutor

def run_validators(lf: pl.LazyFrame, validators: list[Validator]) -> list[ValidationIssue]:
    issues = []
    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(v.validate, lf) for v in validators]
        for future in futures:
            issues.extend(future.result())
    return issues

Metric Deduplication (VE-3)

The SharedMetricStore eliminates redundant computations across validators. When multiple validators require the same metric (e.g., null_count on column email), the metric is computed once and shared.

┌────────────────────────────────────────────────────────────────────────┐
│                     Metric Deduplication Pipeline                       │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  NullValidator      NotNullValidator     CompletenessRatioValidator    │
│       │                    │                        │                   │
│       ▼                    ▼                        ▼                   │
│  get_required_metrics() → MetricKey(null_count, email)                 │
│  get_required_metrics() → MetricKey(null_count, email)  ← DUPLICATE   │
│  get_required_metrics() → MetricKey(row_count, None)                   │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────┐           │
│  │  SharedMetricStore._precompute_shared_metrics()          │           │
│  │  Deduplicates → 2 unique MetricKeys                      │           │
│  │  Single lf.select([null_count_expr, row_count_expr])     │           │
│  │        .collect()                                         │           │
│  └────────────────────────────┬────────────────────────────┘           │
│                                │                                        │
│                     store.get(key) → cached value                      │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Key components (validators/metrics.py):

Component Description
MetricKey Frozen dataclass: (metric_name, column, kwargs_hash)
SharedMetricStore RLock-based thread-safe cache with get_or_compute()
CommonMetrics 11 standard metrics (row_count, null_count, n_unique, mean, std, min, max, sum, quantile, median, non_null_count)

Conditional Execution via Dependency DAG (VE-4)

Validators declare dependencies and skip conditions. The DAG executor evaluates these at runtime:

class MyValidator(Validator):
    dependencies = {"schema_check"}  # Must run after schema_check

    def get_skip_conditions(self) -> list[SkipCondition]:
        return [
            SkipCondition(depends_on="schema_check", skip_when="failed"),
            SkipCondition(depends_on="null_check", skip_when="critical"),
        ]

Priority hierarchy: Schema (10–30) → Completeness (50) → Uniqueness (60) → Distribution (70–80) → Referential (90).

Exception Isolation with 3-Tier Fallback (VE-5)

The ExpressionBatchExecutor implements progressive fallback to maximize partial result collection:

Tier 1: Batch all validators → single collect()
        │ failure
Tier 2: Per-validator execution → individual collect() per validator
        │ failure
Tier 3: Per-expression execution → individual collect() per expression
        (partial_failure_mode: collect | skip | raise)

_validate_safe() wraps each execution with exponential backoff retry (0.1s → 0.2s → 0.4s, capped at 5s). The ExceptionInfo dataclass classifies exceptions into four categories: transient, permanent, configuration, data.

Memory Management

from truthound.datasources.base import DataSourceConfig

config = DataSourceConfig(
    max_rows=10_000_000,      # Maximum rows before sampling
    max_memory_mb=4096,        # Memory threshold
    sample_size=100_000,       # Default sample size
    sample_seed=42,            # Reproducible sampling
)

7. Extension Points

7.1 Custom Validators

from truthound.validators.base import Validator, register_validator

@register_validator("custom_check")
class CustomValidator(Validator):
    name = "custom_check"
    category = "custom"

    def __init__(self, column: str, threshold: float):
        self.column = column
        self.threshold = threshold

    def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        # Custom validation logic
        ...

7.2 Custom Data Sources

from truthound.datasources import BaseDataSource, register_source

@register_source("custom")
class CustomDataSource(BaseDataSource):
    source_type = "custom"

    def to_lazyframe(self) -> pl.LazyFrame:
        # Convert custom format to LazyFrame
        ...

7.3 Custom Reporters

from truthound.reporters import ValidationReporter, register_reporter

@register_reporter("xml")
class XMLReporter(ValidationReporter):
    name = "xml"
    file_extension = ".xml"

    def render(self, data: ValidationResult) -> str:
        # Render to XML format
        ...

7.4 Plugin System

The plugin architecture enables external extensions:

from truthound.plugins import ValidatorPlugin, register_plugin

@register_plugin
class MyValidatorPlugin(ValidatorPlugin):
    def get_validators(self) -> list[type]:
        return [MyCustomValidator1, MyCustomValidator2]

See Plugin Architecture for comprehensive plugin documentation.


8. Phase Overview

Truthound's development follows a phased approach:

Phase Status Description
Phase 1 Complete Core validation engine with LazyFrame architecture
Phase 2 Complete Advanced validators (aggregate, cross-table, drift, anomaly, privacy)
Phase 3 Complete Extensibility (referential, time series, business, localization, ML)
Phase 4 Complete Storage backends and reporters infrastructure
Phase 5 Complete Multi-data source support (BigQuery, Snowflake, Databricks, etc.)
Phase 6 Complete Checkpoint orchestration and CI/CD integration
Phase 7 Complete Auto-profiling and rule generation
Phase 8 Complete Data Docs (HTML report generation)
Phase 9 Complete Plugin architecture
Phase 10 Complete Advanced features (ML, Lineage, Realtime)
VE 1-5 Complete Validation Engine Enhancement (GX-inspired architecture)

Validation Engine Enhancement (VE) Phases

Phase Feature Key Components Tests
VE-1 Result Format System ResultFormat enum (4 levels), ResultFormatConfig, 4-phase enrichment 84
VE-2 Structured Results ValidationDetail, ValidationIssue.result/validator_name/success 57
VE-3 Metric Deduplication MetricKey, SharedMetricStore, CommonMetrics (11 metrics) 58
VE-4 Dependency DAG Activation SkipCondition, should_skip(), priority-based level grouping 64
VE-5 Exception Isolation & Retry ExceptionInfo, 3-tier fallback, ValidationResiliencePolicy 53

Feature Distribution

┌────────────────────────────────────────────────────────────────────────────┐
│                         Truthound Feature Map                               │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Phase 1-3: Core Engine                                                     │
│  ├── 289 Validators across 28 categories                                   │
│  ├── Schema inference and learning                                          │
│  ├── Pattern detection (email, phone, credit card, etc.)                   │
│  └── Statistical validation (range, distribution, outliers)                │
│                                                                             │
│  Phase 4: Infrastructure                                                    │
│  ├── Storage backends (Filesystem, S3, GCS, Database)                      │
│  └── Reporters (Console, JSON, HTML, Markdown, JUnit)                      │
│                                                                             │
│  Phase 5: Multi-Source                                                      │
│  ├── DataFrame (Polars, Pandas, Spark)                                     │
│  ├── SQL (PostgreSQL, MySQL, SQLite)                                       │
│  └── Cloud DW (BigQuery, Snowflake, Redshift, Databricks)                 │
│                                                                             │
│  Phase 6: CI/CD                                                             │
│  ├── Checkpoint orchestration                                               │
│  ├── 12 CI platform support                                                 │
│  ├── Async execution                                                        │
│  └── Transaction management (Saga pattern)                                  │
│                                                                             │
│  Phase 7: Auto-Profiling                                                    │
│  ├── Statistical profiling                                                  │
│  ├── Pattern detection                                                      │
│  └── Rule generation                                                        │
│                                                                             │
│  Phase 8: Data Docs                                                         │
│  ├── HTML report generation                                                 │
│  ├── 5 themes, 4 chart libraries                                           │
│  └── Interactive dashboard (optional)                                       │
│                                                                             │
│  Phase 9: Plugin Architecture                                               │
│  ├── Validator plugins                                                      │
│  ├── Reporter plugins                                                       │
│  ├── DataSource plugins                                                     │
│  └── Hook system                                                            │
│                                                                             │
│  Phase 10: Advanced                                                         │
│  ├── ML Module (anomaly detection, drift, rule learning)                   │
│  ├── Lineage Module (graph, tracking, impact analysis)                     │
│  └── Realtime Module (streaming, incremental, checkpointing)              │
│                                                                             │
│  VE 1-5: Validation Engine Enhancement (GX-Inspired)                       │
│  ├── VE-1: Result Format System (4-level detail control)                   │
│  ├── VE-2: Structured Validation Results (ValidationDetail)                │
│  ├── VE-3: Metric Deduplication (SharedMetricStore, CommonMetrics)         │
│  ├── VE-4: Dependency DAG Activation (SkipCondition, conditional exec)    │
│  └── VE-5: Exception Isolation & Auto Retry (3-tier fallback)             │
│                                                                             │
└────────────────────────────────────────────────────────────────────────────┘

9. Performance Architecture

Expression-Based Validator Architecture

Truthound implements an expression-based architecture that allows multiple validators to execute in a single collect() call.

┌─────────────────────────────────────────────────────────────────────────────┐
│                   Expression-Based Batch Execution                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│    Validator 1          Validator 2          Validator 3                     │
│  (NullValidator)    (RangeValidator)    (CompletenessRatio)                 │
│         │                  │                    │                            │
│         ▼                  ▼                    ▼                            │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐                      │
│  │ get_valida- │    │ get_valida- │    │ get_valida- │                      │
│  │ tion_exprs  │    │ tion_exprs  │    │ tion_exprs  │                      │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘                      │
│         │                  │                  │                              │
│         └──────────────────┼──────────────────┘                              │
│                            ▼                                                 │
│                 ┌─────────────────────┐                                      │
│                 │  Expression Batch   │                                      │
│                 │     Executor        │                                      │
│                 └──────────┬──────────┘                                      │
│                            │                                                 │
│                            ▼                                                 │
│                 ┌─────────────────────┐                                      │
│                 │  lf.select([...])   │  ◄─── Single collect() call         │
│                 │     .collect()      │                                      │
│                 └──────────┬──────────┘                                      │
│                            │                                                 │
│                            ▼                                                 │
│                 ┌─────────────────────┐                                      │
│                 │   ValidationIssue   │                                      │
│                 │       Results       │                                      │
│                 └─────────────────────┘                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Components (src/truthound/validators/base.py):

Component Description
ValidationExpressionSpec Defines validation expression (count_expr, non_null_expr, severity thresholds, filter_expr for sample collection)
ExpressionValidatorMixin Mixin for single-validator expression-based execution
ExpressionBatchExecutor Batches multiple validators into single collect() with 3-tier fallback (VE-5)
SharedMetricStore Session-scoped thread-safe metric cache for deduplication (VE-3)
SkipCondition Conditional execution based on prior validator results (VE-4)
ExceptionInfo Rich exception context with 4-category classification (VE-5)

Lazy Loading Architecture

The validator registry uses lazy loading to minimize startup time.

┌─────────────────────────────────────────────────────────────────────────────┐
│                      Lazy Loading Validator Registry                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│    Application Start                                                         │
│          │                                                                   │
│          ▼                                                                   │
│    ┌─────────────────┐                                                       │
│    │ VALIDATOR_      │ ◄─── 200+ validators mapped to module paths          │
│    │ IMPORT_MAP      │      (not loaded yet)                                │
│    └────────┬────────┘                                                       │
│             │                                                                │
│             ▼                                                                │
│    ┌─────────────────┐                                                       │
│    │ get_validator() │ ◄─── User requests specific validator                │
│    └────────┬────────┘                                                       │
│             │                                                                │
│             ▼                                                                │
│    ┌─────────────────┐                                                       │
│    │ LazyValidator-  │ ◄─── On-demand import                                │
│    │ Loader          │                                                       │
│    └────────┬────────┘                                                       │
│             │                                                                │
│             ▼                                                                │
│    ┌─────────────────┐                                                       │
│    │ ValidatorImport │ ◄─── Metrics tracking (success/failure/timing)       │
│    │ Metrics         │                                                       │
│    └─────────────────┘                                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation (src/truthound/validators/_lazy.py): - VALIDATOR_IMPORT_MAP: 200+ validators mapped to their module paths - CATEGORY_MODULES: 28 category modules for bulk loading - ValidatorImportMetrics: Tracks import success/failure counts and timing

Native Polars Optimizations

All data operations use native Polars expressions without Python callbacks.

Operation Pattern File
Masking (redact) pl.when/then/otherwise, str.replace_all() maskers.py
Masking (hash) pl.col().hash().cast(pl.String) maskers.py
Statistics Single select() with all aggregations schema.py
Validation count_expr, non_null_expr expressions validators/base.py

Cache Optimization

Cache fingerprinting uses xxhash for ~10x faster hashing.

# Implementation in cache.py
def _fast_hash(content: str) -> str:
    if _HAS_XXHASH:
        return xxhash.xxh64(content.encode()).hexdigest()[:16]
    return hashlib.sha256(content.encode()).hexdigest()[:16]

Report Optimization

Validation reports use heap-based sorting for O(1) most-severe-issue access.

# Implementation in report.py
_SEVERITY_ORDER = {"critical": 0, "high": 1, "medium": 2, "low": 3, "info": 4}

def add_issue(self, issue: ValidationIssue) -> None:
    heapq.heappush(
        self._issues_heap,
        (_SEVERITY_ORDER[issue.severity], self._heap_counter, issue),
    )

Performance Summary

Optimization Location Effect
Expression Batch Executor validators/base.py Multiple validators, single collect()
Lazy Loading Registry validators/_lazy.py 200+ validator on-demand loading
xxhash Cache cache.py ~10x faster fingerprinting
Native Polars Masking maskers.py No map_elements callbacks
Heap-Based Sorting report.py O(1) severity access
Batched Statistics schema.py Single select() for all stats
Streaming Mode maskers.py engine="streaming" for >1M rows
Result Format Gating (VE-1) validators/base.py Skip enrichment phases based on ResultFormat level
Shared Metric Store (VE-3) validators/metrics.py Deduplicate metric computations across validators
Dependency DAG Skip (VE-4) validators/base.py Skip validators when upstream dependencies fail
3-Tier Fallback (VE-5) validators/base.py Graceful degradation: batch → per-validator → per-expression

10. Testing Architecture

Design Patterns

Protocol-Based Optional Dependencies

Optional dependencies (boto3, sqlalchemy, jinja2) use Protocol definitions for type safety:

# In _protocols.py
@runtime_checkable
class S3ClientProtocol(Protocol):
    def put_object(self, *, Bucket: str, Key: str, Body: bytes) -> dict[str, Any]: ...
    def get_object(self, *, Bucket: str, Key: str) -> dict[str, Any]: ...

Mock-Based Testing

Optional dependencies are tested using comprehensive mocks:

class MockS3Client:
    """In-memory S3 client mock."""

    def __init__(self) -> None:
        self._buckets: dict[str, dict[str, MockS3Object]] = {}

    def put_object(self, *, Bucket: str, Key: str, Body: bytes, ...) -> dict[str, Any]:
        self._buckets[Bucket][Key] = MockS3Object(body=Body)
        return {"ETag": "mock-etag"}

    def get_object(self, *, Bucket: str, Key: str) -> dict[str, Any]:
        obj = self._buckets[Bucket].get(Key)
        if not obj:
            raise MockS3ClientError("NoSuchKey", "Object not found")
        return {"Body": io.BytesIO(obj.body)}

Test Categories

Note: Test counts change as the codebase evolves. Run pytest --collect-only for current counts.

Category Description
Unit Tests Core functionality
Validator Tests Validator implementations
Integration Tests End-to-end workflows
Mock Backend Tests Optional dependency testing
E2E Tests Complete pipeline tests

See Also