Pattern Matching¶

This document describes the Polars native pattern matching system.

Overview¶

The pattern matching system implemented in src/truthound/profiler/native_patterns.py uses Polars' vectorized str.contains() operations for high-performance pattern detection.

PatternSpec¶

A dataclass for defining pattern specifications.

@dataclass
class PatternSpec:
    """Pattern specification definition"""

    name: str                      # Pattern name (e.g., "email", "phone")
    regex: str                     # Regular expression pattern
    data_type: DataType            # Data type to infer upon matching
    priority: int = 0              # Priority (higher values match first)
    examples: list[str] = field(default_factory=list)  # Example values
    description: str = ""          # Pattern description
    category: str = "general"      # Pattern category

PatternBuilder¶

Pattern definition using a fluent API.

from truthound.profiler.native_patterns import PatternBuilder

# Create pattern using fluent style
pattern = (
    PatternBuilder("korean_mobile")
    .regex(r"^01[0-9]-?[0-9]{3,4}-?[0-9]{4}$")
    .data_type(DataType.KOREAN_PHONE)
    .priority(100)
    .examples(["010-1234-5678", "01012345678"])
    .description("Korean mobile phone number")
    .category("korean")
    .build()
)

NativePatternMatcher¶

A pattern matcher using native Polars operations.

from truthound.profiler.native_patterns import NativePatternMatcher

# Create matcher
matcher = NativePatternMatcher()

# Match patterns in column
results = matcher.match(lf, "email_column")

for result in results:
    print(f"Pattern: {result.pattern_name}")
    print(f"Match ratio: {result.match_ratio:.2%}")
    print(f"Data type: {result.data_type}")

Internal Implementation¶

class NativePatternMatcher:
    """Polars native pattern matcher"""

    def match(self, lf: pl.LazyFrame, column: str) -> list[PatternMatch]:
        """
        Perform high-performance pattern matching using
        Polars' vectorized str.contains()
        """
        col = pl.col(column)

        for pattern in self._patterns:
            # Native Polars operation (no map_elements)
            match_expr = col.str.contains(pattern.regex)
            match_count = match_expr.sum()
            # ...

Built-in Patterns¶

General Patterns¶

Pattern Name	Data Type	Description
`email`	`EMAIL`	Email addresses
`url`	`URL`	URL/URI
`uuid`	`UUID`	UUID (v1-v5)
`ip_address`	`IP_ADDRESS`	IPv4/IPv6
`phone`	`PHONE`	International phone numbers
`date_iso`	`DATE`	ISO 8601 dates
`datetime_iso`	`DATETIME`	ISO 8601 datetime
`json`	`JSON`	JSON objects/arrays
`currency`	`CURRENCY`	Currency amounts
`percentage`	`PERCENTAGE`	Percentages

Korean-Specific Patterns¶

Pattern Name	Data Type	Description
`korean_rrn`	`KOREAN_RRN`	Resident registration number
`korean_phone`	`KOREAN_PHONE`	Korean phone numbers
`korean_mobile`	`KOREAN_PHONE`	Korean mobile phone numbers
`korean_business_number`	`KOREAN_BUSINESS_NUMBER`	Business registration number

Pattern Registry¶

from truthound.profiler.native_patterns import PatternRegistry

# Retrieve default patterns
email_pattern = PatternRegistry.get("email")

# Retrieve patterns by category
korean_patterns = PatternRegistry.get_by_category("korean")

# Register custom patterns
PatternRegistry.register(
    PatternSpec(
        name="custom_id",
        regex=r"^[A-Z]{2}\d{6}$",
        data_type=DataType.IDENTIFIER,
        priority=50,
        examples=["AB123456"],
        description="Company-specific ID format",
    )
)

# Remove pattern
PatternRegistry.unregister("custom_id")

PatternMatch Result¶

@dataclass
class PatternMatch:
    """Pattern matching result"""

    pattern_name: str       # Matched pattern name
    regex: str              # Regular expression used
    data_type: DataType     # Inferred data type
    match_count: int        # Number of matched rows
    total_count: int        # Total rows (excluding null)
    match_ratio: float      # Match ratio (0.0-1.0)
    confidence: float       # Confidence level
    sample_matches: list[str]  # Sample matched values

Priority-Based Matching¶

When multiple patterns match, results are returned based on priority.

# Priority example
patterns = [
    PatternSpec("korean_mobile", ..., priority=100),  # Checked first
    PatternSpec("phone", ..., priority=50),           # General phone number
    PatternSpec("numeric", ..., priority=10),         # Numeric
]

# Korean mobile numbers match before general phone numbers

Performance Optimization¶

Vectorized Operations¶

# Internal implementation - no Python callbacks
def _count_matches(self, lf: pl.LazyFrame, column: str, pattern: str) -> int:
    return (
        lf.select(
            pl.col(column)
            .str.contains(pattern)  # Polars native
            .sum()
        )
        .collect()
        .item()
    )

Combining with Sampling¶

from truthound.profiler.native_patterns import NativePatternMatcher
from truthound.profiler.sampling import Sampler, SamplingConfig

# Sample from large data then perform pattern matching
sampler = Sampler(SamplingConfig(max_rows=10_000))
sampled_result = sampler.sample(lf)

matcher = NativePatternMatcher()
patterns = matcher.match(sampled_result.data.lazy(), "email")

CLI Usage¶

# Profile with pattern detection
th profile data.csv --include-patterns

# Disable pattern detection
th profile data.csv --no-patterns

# Pattern detection for specific columns only
th profile data.csv --pattern-columns email,phone

Custom Pattern Files¶

Custom patterns can be defined in YAML format.

# custom_patterns.yaml
patterns:
  - name: employee_id
    regex: "^EMP\\d{5}$"
    data_type: identifier
    priority: 80
    examples:
      - EMP00001
      - EMP12345
    description: Employee ID

  - name: product_sku
    regex: "^[A-Z]{3}-\\d{4}-[A-Z]$"
    data_type: identifier
    priority: 70
    examples:
      - ABC-1234-X
    description: Product SKU

from truthound.profiler.native_patterns import load_patterns_from_yaml

# Load custom patterns
patterns = load_patterns_from_yaml("custom_patterns.yaml")
PatternRegistry.register_all(patterns)

Next Steps¶

Rule Generation - Generate validation rules from detected patterns
ML Inference - ML-based type inference