Skip to content

Creating Custom Validators

Learn how to create custom validators for your specific data quality needs.

Overview

Truthound's validator SDK makes it easy to create custom validators that integrate seamlessly with the framework. The SDK provides three approaches:

  1. Decorator-Based - Quick and simple for straightforward validators
  2. Class-Based - Full control with inheritance for complex validators
  3. Fluent Builder - Chainable API for one-off validators

Prerequisites

  • Basic Python knowledge
  • Truthound installed (pip install truthound)
  • Familiarity with Polars DataFrames

Method 1: Decorator-Based Validators

The @custom_validator decorator is the simplest way to create a validator:

from truthound.validators.sdk import (
    custom_validator,
    Validator,
    ValidationIssue,
)
from truthound.types import Severity
import polars as pl

@custom_validator(
    name="positive_values",
    category="numeric",
    description="Checks that all values are positive",
    tags=["numeric", "range", "positive"],
)
class PositiveValuesValidator(Validator):
    """Validate that column values are positive."""

    def __init__(self, allow_zero: bool = False, **kwargs):
        super().__init__(**kwargs)
        self.allow_zero = allow_zero

    def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        issues = []

        for col in self._get_target_columns(lf):
            # Check column type
            schema = lf.collect_schema()
            if col not in schema or not schema[col].is_numeric():
                continue

            # Count violations
            if self.allow_zero:
                invalid_count = (
                    lf.filter(pl.col(col) < 0)
                    .select(pl.len())
                    .collect()
                    .item()
                )
            else:
                invalid_count = (
                    lf.filter(pl.col(col) <= 0)
                    .select(pl.len())
                    .collect()
                    .item()
                )

            if invalid_count > 0:
                total = lf.select(pl.len()).collect().item()
                issues.append(
                    ValidationIssue(
                        column=col,
                        issue_type="non_positive_value",
                        count=invalid_count,
                        severity=Severity.HIGH,
                        details=f"Found {invalid_count}/{total} non-positive values",
                    )
                )

        return issues

Decorator Parameters

Parameter Type Description
name str Unique validator name (required)
category str Category for grouping (default: "custom")
description str Human-readable description
version str Semantic version (default: "1.0.0")
author str Author name or email
tags list[str] Tags for filtering and discovery
auto_register bool Auto-register in global registry (default: True)

Method 2: Class-Based Validators

For more complex validators with state or multiple checks:

from truthound.validators.sdk import (
    Validator,
    ValidationIssue,
    NumericValidatorMixin,
)
from truthound.types import Severity
import polars as pl

class RangeValidator(Validator, NumericValidatorMixin):
    """Validates that values fall within a specified range."""

    name = "custom_range"
    category = "numeric"

    def __init__(
        self,
        min_value: float | None = None,
        max_value: float | None = None,
        inclusive: bool = True,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.min_value = min_value
        self.max_value = max_value
        self.inclusive = inclusive

    def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        issues = []
        columns = self._get_numeric_columns(lf)

        for col in columns:
            # Build filter for violations
            conditions = []

            if self.min_value is not None:
                if self.inclusive:
                    conditions.append(pl.col(col) < self.min_value)
                else:
                    conditions.append(pl.col(col) <= self.min_value)

            if self.max_value is not None:
                if self.inclusive:
                    conditions.append(pl.col(col) > self.max_value)
                else:
                    conditions.append(pl.col(col) >= self.max_value)

            if not conditions:
                continue

            # Combine conditions with OR
            combined = conditions[0]
            for cond in conditions[1:]:
                combined = combined | cond

            invalid_count = (
                lf.filter(combined)
                .select(pl.len())
                .collect()
                .item()
            )

            if invalid_count > 0:
                total = lf.select(pl.len()).collect().item()
                issues.append(
                    ValidationIssue(
                        column=col,
                        issue_type="out_of_range",
                        count=invalid_count,
                        severity=Severity.MEDIUM,
                        details=self._format_message(col, invalid_count, total),
                    )
                )

        return issues

    def _format_message(self, column: str, invalid: int, total: int) -> str:
        range_str = ""
        if self.min_value is not None and self.max_value is not None:
            range_str = f"[{self.min_value}, {self.max_value}]"
        elif self.min_value is not None:
            range_str = f">= {self.min_value}"
        elif self.max_value is not None:
            range_str = f"<= {self.max_value}"

        return f"{column}: {invalid}/{total} values outside range {range_str}"

Available Mixins

The SDK provides mixins for common patterns:

Mixin Description
NumericValidatorMixin _get_numeric_columns() helper
StringValidatorMixin _get_string_columns() helper
DatetimeValidatorMixin _get_datetime_columns() helper
FloatValidatorMixin _get_float_columns() helper
RegexValidatorMixin Safe regex execution with ReDoS protection
StreamingValidatorMixin Support for streaming large datasets

Method 3: Fluent Builder

For quick one-off validators without creating a class:

from truthound.validators.sdk import ValidatorBuilder
from truthound.types import Severity
import polars as pl

# Create validator using builder
email_domain_validator = (
    ValidatorBuilder("email_domain")
    .category("string")
    .description("Validates email domain is from allowed list")
    .for_string_columns()
    .check_column(
        lambda col, lf: lf.filter(
            ~pl.col(col).str.contains(r"@(company\.com|partner\.com)$")
        ).select(pl.len()).collect().item()
    )
    .with_issue_type("invalid_email_domain")
    .with_severity(Severity.MEDIUM)
    .with_message("Column '{column}' has {count} emails with invalid domains")
    .build()
)

# Use the validator
issues = email_domain_validator.validate(df.lazy())

Builder Methods

Method Description
.category(str) Set validator category
.description(str) Set description
.for_numeric_columns() Filter to numeric columns
.for_string_columns() Filter to string columns
.for_datetime_columns() Filter to datetime columns
.check_column(fn) Add check function (col, lf) -> count
.with_issue_type(str) Set issue type
.with_severity(Severity) Set severity level
.with_message(str) Set message template
.build() Build and return validator

Registering Validators

Automatic Registration

Using @custom_validator with auto_register=True (default) automatically registers the validator:

@custom_validator(name="my_validator", category="custom")
class MyValidator(Validator):
    ...

# Now usable via name
import truthound as th
report = th.check(df, validators=["my_validator"])

Manual Registration

For validators without the decorator:

from truthound.validators.sdk import register_validator

@register_validator
class MyValidator(Validator):
    name = "my_validator"
    category = "custom"

    def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        ...

Using Validator Instances

You can also pass validator instances directly:

import truthound as th

# Create validator instance
validator = RangeValidator(min_value=0, max_value=100)

# Use with th.check()
report = th.check(df, validators=[validator])

# Or call directly
issues = validator.validate(df.lazy())

Testing Your Validator

The SDK provides testing utilities:

import pytest
import polars as pl
from truthound.validators.sdk import (
    ValidatorTestCase,
    create_test_dataframe,
    assert_no_issues,
    assert_has_issue,
    assert_issue_count,
)
from my_validators import PositiveValuesValidator

class TestPositiveValuesValidator(ValidatorTestCase):
    """Tests for PositiveValuesValidator."""

    def test_passes_for_positive_values(self):
        """Test that validator passes for positive values."""
        df = create_test_dataframe({
            "amount": [1, 2, 3, 4, 5]
        })

        validator = PositiveValuesValidator()
        issues = validator.validate(df.lazy())

        assert_no_issues(issues)

    def test_fails_for_negative_values(self):
        """Test that validator fails for negative values."""
        df = create_test_dataframe({
            "amount": [1, -2, 3, -4, 5]
        })

        validator = PositiveValuesValidator()
        issues = validator.validate(df.lazy())

        assert_issue_count(issues, 1)
        assert_has_issue(issues, column="amount", issue_type="non_positive_value")

    def test_allow_zero_option(self):
        """Test zero handling with allow_zero option."""
        df = create_test_dataframe({
            "amount": [0, 1, 2]
        })

        # Without allow_zero (default)
        validator = PositiveValuesValidator(allow_zero=False)
        issues = validator.validate(df.lazy())
        assert_issue_count(issues, 1)

        # With allow_zero
        validator = PositiveValuesValidator(allow_zero=True)
        issues = validator.validate(df.lazy())
        assert_no_issues(issues)

Best Practices

1. Use Lazy Evaluation

Work with LazyFrames when possible for better performance:

# Good - uses LazyFrame operations
def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
    count = (
        lf.filter(pl.col(col).is_null())
        .select(pl.len())
        .collect()
        .item()
    )
    ...

# Avoid - collects too early
def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
    data = lf.collect()  # Don't do this!
    ...

2. Provide Clear Messages

Make validation messages actionable:

# Good
details = f"Column '{column}' has {count} values below minimum threshold {min_val}"

# Bad
details = "Validation failed"

3. Include Relevant Details

Always include information useful for debugging:

ValidationIssue(
    column=col,
    issue_type="out_of_range",
    count=invalid_count,
    severity=Severity.MEDIUM,
    details=f"Found {invalid_count} values outside [{min_val}, {max_val}]",
    sample_values=samples[:5] if samples else None,
)

4. Handle Edge Cases

Account for empty data and missing columns:

def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
    issues = []
    schema = lf.collect_schema()

    for col in self._get_target_columns(lf):
        # Check if column exists
        if col not in schema:
            continue

        # Check for empty data
        count = lf.select(pl.len()).collect().item()
        if count == 0:
            continue

        # Regular validation
        ...

    return issues

5. Use Type Filters

Leverage the SDK's type filtering:

from truthound.validators.sdk import NUMERIC_TYPES, STRING_TYPES

class MyValidator(Validator, NumericValidatorMixin):
    def validate(self, lf: pl.LazyFrame) -> list[ValidationIssue]:
        # Only processes numeric columns
        for col in self._get_numeric_columns(lf):
            ...

Enterprise Features

For production environments, the SDK includes enterprise features:

from truthound.validators.sdk import (
    EnterpriseSDKManager,
    EnterpriseConfig,
    SandboxBackend,
    ResourceLimits,
)

# Configure enterprise features
config = EnterpriseConfig(
    sandbox_backend=SandboxBackend.PROCESS,
    resource_limits=ResourceLimits(
        max_memory_mb=512,
        max_cpu_percent=50,
        max_execution_time_seconds=30,
    ),
    enable_signing=True,
)

manager = EnterpriseSDKManager(config)

# Execute validator in sandbox
result = manager.execute_validator(my_validator, df.lazy())

See the API Reference for complete enterprise SDK documentation.

Next Steps