ML-based Type Inference¶

This document describes the ML-based semantic type inference system.

Overview¶

The ML type inference system implemented in src/truthound/profiler/ml_inference.py combines column names, value patterns, and statistical characteristics to infer semantic types.

Feature¶

A dataclass for feature definitions.

@dataclass
class Feature:
    """Single feature"""

    name: str
    value: float
    feature_type: FeatureType  # NAME, VALUE, STATISTICAL, CONTEXTUAL
    importance: float = 1.0
    metadata: dict[str, Any] = field(default_factory=dict)

FeatureVector¶

A feature vector collection.

@dataclass
class FeatureVector:
    """Feature vector"""

    column_name: str
    features: list[Feature]
    raw_values: dict[str, Any] = field(default_factory=dict)

    def to_dict(self) -> dict[str, float]:
        """Convert to dictionary"""
        return {f.name: f.value for f in self.features}

    def to_array(self) -> list[float]:
        """Convert to array"""
        return [f.value for f in self.features]

    def get_feature(self, name: str) -> Feature | None:
        """Retrieve feature by name"""
        for f in self.features:
            if f.name == name:
                return f
        return None

InferenceResult¶

@dataclass
class InferenceResult:
    """Type inference result"""

    column_name: str                  # Column name
    inferred_type: DataType           # Inferred type
    confidence: float                 # Confidence (0.0-1.0)
    alternatives: list[tuple[DataType, float]] = field(default_factory=list)  # Alternative types
    reasoning: list[str] = field(default_factory=list)  # Inference reasoning
    features_used: list[str] = field(default_factory=list)  # Features used
    model_version: str = "1.0"        # Model version
    inference_time_ms: float = 0.0

ContextFeatureExtractor¶

Extracts various types of features. Combines all extractors to generate a comprehensive feature vector.

from truthound.profiler.ml_inference import ContextFeatureExtractor
import polars as pl

extractor = ContextFeatureExtractor()

# Extract features from column (Series)
column = pl.Series("email", ["user@example.com", "test@domain.org"])
context = {"table_name": "users"}
features = extractor.extract(column, context)

for feature in features.features:
    print(f"{feature.name}: {feature.value:.4f} (type: {feature.feature_type})")

Extracted Feature Types¶

Category	Feature	Description
NAME	`name_has_email`	Column name contains 'email'
NAME	`name_has_phone`	Column name contains 'phone'
NAME	`name_has_date`	Column name contains 'date'
NAME	`name_has_id`	Column name contains 'id'
NAME	`name_has_name`	Column name contains 'name'
NAME	`name_has_address`	Column name contains 'address'
NAME	`name_has_url`	Column name contains 'url'
VALUE	`value_has_at_symbol`	Ratio containing '@' symbol
VALUE	`value_has_dot`	Ratio containing '.'
VALUE	`value_digit_ratio`	Numeric character ratio
VALUE	`value_alpha_ratio`	Alphabetic ratio
VALUE	`value_has_dash`	Ratio containing '-'
VALUE	`value_has_slash`	Ratio containing '/'
STATISTICAL	`stat_avg_length`	Average length
STATISTICAL	`stat_std_length`	Length standard deviation
STATISTICAL	`stat_unique_ratio`	Unique value ratio
STATISTICAL	`stat_null_ratio`	Null ratio
STATISTICAL	`stat_min_length`	Minimum length
STATISTICAL	`stat_max_length`	Maximum length

MLTypeInferrer¶

The ML-based type inferrer. Supports multiple models (RuleBasedModel, NaiveBayesModel, EnsembleModel).

from truthound.profiler.ml_inference import MLTypeInferrer
import polars as pl

inferrer = MLTypeInferrer()  # Default: ensemble model

# Single column type inference
column = pl.Series("email", ["user@example.com", "test@domain.org"])
context = {"table_name": "users"}
result = inferrer.infer(column, context)

print(f"Column: {result.column_name}")
print(f"Type: {result.inferred_type}")
print(f"Confidence: {result.confidence:.2%}")

# Alternative types
for dtype, prob in result.alternatives[:3]:
    print(f"  {dtype}: {prob:.2%}")

# Inference reasoning
for reason in result.reasoning:
    print(f"  - {reason}")

Model Types¶

Three built-in models are supported, and custom models can be added by implementing the InferenceModel protocol.

class MLTypeInferrer:
    """ML type inferrer"""

    def __init__(
        self,
        model: str = "ensemble",  # "rule", "naive_bayes", "ensemble"
        config: InferrerConfig | None = None,
    ):
        self._config = config or InferrerConfig()
        self._model = model_registry.get(model)
        self._extractor = ContextFeatureExtractor()
        self._cache: dict[str, InferenceResult] = {}

    def infer(
        self,
        column: pl.Series,
        context: dict[str, Any] | None = None,
    ) -> InferenceResult:
        context = context or {}
        features = self._extractor.extract(column, context)

        # Caching support
        if self._config.use_caching:
            cache_key = self._get_cache_key(column, context)
            if cache_key in self._cache:
                return self._cache[cache_key]

        # Model prediction
        result = self._model.predict(features)
        return result

Full Table Type Inference¶

from truthound.profiler.ml_inference import infer_table_types_ml
import polars as pl

# Infer types for all columns from DataFrame
df = pl.DataFrame({
    "email": ["user@example.com", "test@domain.org"],
    "phone": ["010-1234-5678", "02-123-4567"],
    "age": [25, 30],
})

results = infer_table_types_ml(df, table_name="users", model="ensemble")

for column, result in results.items():
    print(f"{column}: {result.inferred_type} ({result.confidence:.0%})")

Feature Extractor Types¶

All extractors implement the FeatureExtractor protocol and provide the extract(column: pl.Series, context: dict) -> list[Feature] method.

NameFeatureExtractor¶

Extracts features from column names.

from truthound.profiler.ml_inference import NameFeatureExtractor
import polars as pl

extractor = NameFeatureExtractor()
column = pl.Series("customer_email_address", ["user@example.com"])
features = extractor.extract(column, {})

# Extracted features:
# - name_has_email: 1.0
# - name_has_customer: 1.0
# - name_has_address: 1.0

ValueFeatureExtractor¶

Extracts features from column values.

from truthound.profiler.ml_inference import ValueFeatureExtractor
import polars as pl

extractor = ValueFeatureExtractor()
column = pl.Series("email", ["user@example.com", "test@domain.org"])
features = extractor.extract(column, {})

# Extracted features:
# - value_has_at_symbol: 1.0 (100% contain @)
# - value_avg_length: 18.5
# - value_digit_ratio: 0.0

StatisticalFeatureExtractor¶

Extracts statistical features.

from truthound.profiler.ml_inference import StatisticalFeatureExtractor
import polars as pl

extractor = StatisticalFeatureExtractor()
column = pl.Series("email", ["user@example.com", "test@domain.org", None])
features = extractor.extract(column, {})

# Extracted features:
# - stat_unique_ratio: 0.67
# - stat_null_ratio: 0.33
# - stat_count: 3

Custom Model Registration¶

from truthound.profiler.ml_inference import model_registry, InferenceModel, FeatureVector, InferenceResult

# Implement InferenceModel protocol
class MyCustomModel:
    """Custom inference model"""

    @property
    def name(self) -> str:
        return "my_model"

    @property
    def version(self) -> str:
        return "1.0.0"

    def predict(self, features: FeatureVector) -> InferenceResult:
        # Custom inference logic
        ...

# Register in model registry
model_registry.register(MyCustomModel())

# Use registered model
inferrer = MLTypeInferrer(model="my_model")

Inference Configuration¶

from truthound.profiler.ml_inference import InferrerConfig, MLTypeInferrer

config = InferrerConfig(
    model="ensemble",              # Model type ("rule", "naive_bayes", "ensemble")
    confidence_threshold=0.5,      # Minimum confidence
    use_caching=True,              # Enable caching
    cache_size=1000,               # Cache size
    enable_learning=True,          # Enable learning mode
    model_path=None,               # Custom model path (optional)
)

inferrer = MLTypeInferrer(config=config)

CLI Usage¶

# Enable ML-based type inference
th profile data.csv --ml-inference

# Specify model type
th profile data.csv --ml-inference --model ensemble

# Set confidence threshold
th profile data.csv --ml-inference --confidence-threshold 0.8

RuleBasedModel (Rule-Based Inference)¶

The default rule-based model. Infers types based on feature values.

class RuleBasedModel:
    """Rule-based type inference model"""

    @property
    def name(self) -> str:
        return "rule"

    def predict(self, features: FeatureVector) -> InferenceResult:
        """Infer type using feature-based rules"""
        scores: dict[DataType, float] = {}

        # Check email pattern
        at_feature = features.get_feature("value_has_at_symbol")
        dot_feature = features.get_feature("value_has_dot")
        if at_feature and at_feature.value > 0.8:
            if dot_feature and dot_feature.value > 0.9:
                scores[DataType.EMAIL] = 0.85

        # Check phone number pattern
        digit_feature = features.get_feature("value_digit_ratio")
        if digit_feature and digit_feature.value > 0.7:
            scores[DataType.PHONE] = 0.75

        # Return highest scoring type
        best_type = max(scores, key=scores.get, default=DataType.STRING)
        return InferenceResult(
            column_name=features.column_name,
            inferred_type=best_type,
            confidence=scores.get(best_type, 0.5),
            # ...
        )

Next Steps¶

Threshold Tuning - Optimize inference thresholds
Pattern Matching - Pattern-based type detection