Skip to content

Data Profiling Tutorial

Learn how to profile your data and generate validation rules automatically.

Overview

Truthound's profiler analyzes your data to generate statistical summaries and automatically infer validation rules. This is essential for:

  • Understanding data quality before validation
  • Generating baseline schemas for drift detection
  • Auto-generating validation rules from data characteristics

Prerequisites

  • Truthound installed (pip install truthound)
  • Sample data file (CSV, Parquet, or JSON)

Basic Profiling

Using the API

The th.profile() function returns a ProfileReport with basic statistics:

import truthound as th

# Profile a file - returns ProfileReport
profile = th.profile("data.csv")

# View summary
print(f"Rows: {profile.row_count}")
print(f"Columns: {profile.column_count}")
print(f"Size: {profile.size_bytes / 1024:.2f} KB")

# Column details (columns is a list of dicts)
# Note: Both 'null_pct'/'unique_pct' and 'null_count'/'unique_count' are available
for col in profile.columns:
    print(f"\n{col['name']} ({col['dtype']}):")
    print(f"  Null %: {col['null_pct']}")       # or col['null_count'] for count
    print(f"  Unique %: {col['unique_pct']}")   # or col['unique_count'] for count
    if col.get('min'):
        print(f"  Range: [{col['min']}, {col['max']}]")

# Print formatted report
profile.print()

Using the CLI

# Basic profile
truthound profile data.csv

# Output as JSON
truthound profile data.csv --format json > profile.json

# Auto-profile with rule generation
truthound auto-profile data.csv -o profile.json

Advanced Profiling with DataProfiler

For detailed profiling with pattern detection and correlations, use the DataProfiler class:

from truthound.profiler.table_profiler import DataProfiler
from truthound.profiler.base import ProfilerConfig
import polars as pl

# Configure profiler
config = ProfilerConfig(
    sample_size=10000,         # Sample for large datasets
    include_patterns=True,     # Detect patterns (email, phone, etc.)
    include_correlations=True, # Calculate correlations
    n_jobs=4,                  # Parallel processing threads
)

# Create profiler
profiler = DataProfiler(config=config)

# Profile data - returns TableProfile
df = pl.read_parquet("data.parquet")
table_profile = profiler.profile(df.lazy(), name="my_data")

# Access results
print(f"Row count: {table_profile.row_count}")
print(f"Duplicate rows: {table_profile.duplicate_row_count}")
print(f"Duration: {table_profile.profile_duration_ms:.2f}ms")

# Column profiles (TableProfile.columns is a tuple of ColumnProfile)
for col_profile in table_profile.columns:
    print(f"\n{col_profile.name}:")
    print(f"  Physical type: {col_profile.physical_type}")
    print(f"  Inferred type: {col_profile.inferred_type.value}")
    print(f"  Nulls: {col_profile.null_count} ({col_profile.null_ratio:.2%})")
    print(f"  Unique: {col_profile.distinct_count} ({col_profile.unique_ratio:.2%})")

    # Distribution stats for numeric columns
    if col_profile.distribution:
        dist = col_profile.distribution
        print(f"  Mean: {dist.mean:.2f}, Std: {dist.std:.2f}")
        print(f"  Range: [{dist.min}, {dist.max}]")

    # Detected patterns (email, phone, URL, etc.)
    if col_profile.detected_patterns:
        patterns = [p.pattern for p in col_profile.detected_patterns]
        print(f"  Detected patterns: {patterns}")

Convenience Functions

from truthound.profiler.table_profiler import profile_file, profile_dataframe

# Profile from file - returns TableProfile
profile = profile_file("data.parquet")

# Profile DataFrame
import polars as pl
df = pl.read_csv("data.csv")
profile = profile_dataframe(df, name="my_data")

# Convert to dict for serialization
profile_dict = profile.to_dict()

Specialized Table Analyzers

The profiler architecture provides modular analyzers for specific analysis tasks. These components can be utilized independently or in conjunction with DataProfiler for fine-grained control over the profiling process.

DuplicateRowAnalyzer

Identifies duplicate rows within the dataset:

from truthound.profiler import DuplicateRowAnalyzer, ProfilerConfig

analyzer = DuplicateRowAnalyzer()
result = analyzer.analyze(df.lazy(), ProfilerConfig())

print(f"Duplicate rows: {result['duplicate_row_count']}")
print(f"Duplicate ratio: {result['duplicate_row_ratio']:.2%}")

MemoryEstimator

Estimates memory consumption for datasets:

from truthound.profiler import MemoryEstimator, ProfilerConfig

estimator = MemoryEstimator()
result = estimator.analyze(df.lazy(), ProfilerConfig())

size_mb = result["estimated_memory_bytes"] / (1024 * 1024)
print(f"Estimated memory: {size_mb:.2f} MB")

CorrelationAnalyzer

Computes pairwise correlations between numeric columns:

from truthound.profiler import CorrelationAnalyzer, ProfilerConfig

# Configure correlation threshold
analyzer = CorrelationAnalyzer(threshold=0.5)
config = ProfilerConfig(correlation_threshold=0.5)
result = analyzer.analyze(df.lazy(), config)

for col1, col2, corr in result["correlations"]:
    direction = "positive" if corr > 0 else "negative"
    print(f"{col1} <-> {col2}: {corr:.3f} ({direction} correlation)")

Custom TableAnalyzer Implementation

The TableAnalyzer protocol enables extension of profiling capabilities through custom implementations:

from truthound.profiler import TableAnalyzer, ProfilerConfig
import polars as pl

class RowCountAnalyzer(TableAnalyzer):
    """Custom analyzer for row count categorization."""
    name = "row_count_custom"

    def analyze(self, lf: pl.LazyFrame, config: ProfilerConfig) -> dict:
        row_count = lf.select(pl.len()).collect().item()
        return {
            "custom_row_count": row_count,
            "row_category": (
                "small" if row_count < 100
                else "medium" if row_count < 1000
                else "large"
            ),
        }

# Register custom analyzer with DataProfiler
profiler = DataProfiler()
profiler.add_table_analyzer(RowCountAnalyzer())
profile = profiler.profile(df.lazy(), name="custom_analysis")

Generating Validation Rules

From Profile to Rules

# Generate validation suite from profile
truthound generate-suite profile.json -o rules.yaml

# One-step: profile + generate suite
truthound quick-suite data.csv -o rules.yaml

# With specific categories
truthound quick-suite data.csv -o rules.yaml --categories completeness,uniqueness,range

Using the API

from truthound.profiler.suite_export import SuiteExporter
from truthound.profiler.table_profiler import profile_file

# Profile data
profile = profile_file("data.csv")

# Export as validation suite
exporter = SuiteExporter()
suite = exporter.export(profile)

# Save suite
suite.save("validation_suite.yaml")

# Use for validation
import truthound as th
report = th.check("new_data.csv", schema="validation_suite.yaml")

Schema Learning

Auto-Learn Schema with Constraints

import truthound as th

# Learn schema with constraint inference
schema = th.learn(
    "baseline.csv",
    infer_constraints=True,
    categorical_threshold=20  # Max unique values for categorical
)

# View inferred constraints
for col in schema.columns.values():
    print(f"{col.name}:")
    print(f"  Type: {col.dtype}")
    print(f"  Nullable: {col.nullable}")
    if col.min_value is not None:
        print(f"  Range: [{col.min_value}, {col.max_value}]")
    if col.allowed_values:
        print(f"  Allowed: {col.allowed_values}")

# Save schema
schema.save("schema.yaml")

# Validate new data against schema
report = th.check("new_data.csv", schema=schema)

Zero-Configuration with Auto Caching

import truthound as th

# First run: learns and caches schema
report = th.check("data.csv", auto_schema=True)

# Subsequent runs: uses cached schema
report = th.check("data.csv", auto_schema=True)

# Cache is invalidated when file changes (based on fingerprint)

Data Drift Detection

Truthound provides multiple approaches for detecting data drift between datasets. The high-level th.compare() API offers simplified access, while the ProfileComparator class provides comprehensive control over the comparison process.

High-Level API: th.compare()

The th.compare() function provides a streamlined interface for drift detection:

import truthound as th

# Compare baseline and current data
drift = th.compare("train.csv", "production.csv")
print(drift)

if drift.has_drift:
    print("Data drift detected!")
    for col_drift in drift.columns:
        if col_drift.result.drifted:
            print(f"  - {col_drift.column}: {col_drift.result.method} = {col_drift.result.statistic:.4f}")

# Check for high drift
if drift.has_high_drift:
    print("WARNING: High drift detected!")

# Get list of drifted column names
drifted_cols = drift.get_drifted_columns()
print(f"Drifted columns: {drifted_cols}")

Advanced API: ProfileComparator

For fine-grained control over drift detection, the ProfileComparator class operates on pre-computed TableProfile objects:

from truthound.profiler import (
    DataProfiler,
    ProfileComparator,
    compare_profiles,
    DriftType,
    DriftSeverity,
    DriftThresholds,
)

# Generate profiles
profiler = DataProfiler()
baseline_profile = profiler.profile(baseline_df.lazy(), name="baseline")
current_profile = profiler.profile(current_df.lazy(), name="current")

# Compare using ProfileComparator
comparator = ProfileComparator()
comparison = comparator.compare(baseline_profile, current_profile)

print(f"Has Drift: {comparison.has_drift}")
print(f"Total Drifts: {comparison.drift_count}")

# Alternatively, use the convenience function
comparison = compare_profiles(baseline_profile, current_profile)

Filtering by Drift Type and Severity

# Filter by drift type
completeness_drifts = comparison.get_by_type(DriftType.COMPLETENESS)
distribution_drifts = comparison.get_by_type(DriftType.DISTRIBUTION)
range_drifts = comparison.get_by_type(DriftType.RANGE)
cardinality_drifts = comparison.get_by_type(DriftType.CARDINALITY)

# Filter by severity
critical_drifts = comparison.get_by_severity(DriftSeverity.CRITICAL)
warning_drifts = comparison.get_by_severity(DriftSeverity.WARNING)
info_drifts = comparison.get_by_severity(DriftSeverity.INFO)

# Get specific column comparison
age_comparison = comparison.get_column("age")
if age_comparison and age_comparison.has_drift:
    for drift in age_comparison.drifts:
        print(f"  {drift.drift_type}: {drift.severity}")

Custom Drift Thresholds

# Configure sensitive thresholds
sensitive_thresholds = DriftThresholds(
    null_ratio_warning=0.01,   # 1% change triggers warning
    null_ratio_critical=0.05,
    mean_warning=0.05,
    mean_critical=0.1,
)

# Configure lenient thresholds
lenient_thresholds = DriftThresholds(
    null_ratio_warning=0.2,    # 20% change required for warning
    null_ratio_critical=0.5,
    mean_warning=0.3,
    mean_critical=0.5,
)

# Apply thresholds to comparator
comparator_sensitive = ProfileComparator(thresholds=sensitive_thresholds)
comparison = comparator_sensitive.compare(baseline_profile, current_profile)

Generating Drift Reports

# Generate text report
report = comparison.to_report()
print(report)

# Output includes:
# - Summary with drift counts by severity
# - Detailed breakdown of critical and warning drifts
# - Per-column change descriptions

Specifying Detection Method

import truthound as th

# Auto-select based on data type (default, recommended)
drift = th.compare(baseline, current, method="auto")

# Kolmogorov-Smirnov test (numeric columns only)
drift = th.compare(baseline, current, method="ks")

# Population Stability Index (numeric columns only)
drift = th.compare(baseline, current, method="psi")

# Chi-square test (categorical columns)
drift = th.compare(baseline, current, method="chi2")

# Jensen-Shannon divergence (works with any column type)
drift = th.compare(baseline, current, method="js")

# Custom threshold
drift = th.compare(baseline, current, threshold=0.2)

# With sampling for large datasets
drift = th.compare(baseline, current, sample_size=10000)

Note: ks and psi methods only work with numeric columns. If your data contains non-numeric columns (e.g., strings), use --columns to specify numeric columns only, or use method="auto" which automatically selects the appropriate method per column.

# Compare only numeric columns with PSI
drift = th.compare(baseline, current, method="psi", columns=["age", "salary", "score"])

Best Practices

1. Profile Before Validation

Always profile new data sources before setting up validation:

import truthound as th
from truthound.profiler.table_profiler import profile_file

# Profile first to understand the data
profile = profile_file("new_dataset.csv")
print(f"Rows: {profile.row_count}, Columns: {profile.column_count}")

# Then set up appropriate validation
schema = th.learn("new_dataset.csv", infer_constraints=True)
schema.save("new_dataset_schema.yaml")

2. Use Sampling for Large Datasets

from truthound.profiler.table_profiler import DataProfiler
from truthound.profiler.base import ProfilerConfig

config = ProfilerConfig(
    sample_size=50_000,  # Profile 50K rows
    random_seed=42,       # Reproducible sampling
)
profiler = DataProfiler(config=config)

3. Store Profiles for Historical Analysis

import json
from datetime import datetime
from truthound.profiler.table_profiler import profile_file

# Profile and save with timestamp
profile = profile_file("data.csv")
profile_dict = profile.to_dict()

filename = f"profiles/data_{datetime.now():%Y%m%d_%H%M%S}.json"
with open(filename, "w") as f:
    json.dump(profile_dict, f, indent=2, default=str)

4. Parallel Processing

Enable parallel column profiling for faster results:

from truthound.profiler.table_profiler import DataProfiler
from truthound.profiler.base import ProfilerConfig

config = ProfilerConfig(
    n_jobs=4,  # Use 4 threads for parallel column profiling
)
profiler = DataProfiler(config=config)

Data Structures Reference

ProfileReport (from th.profile)

The simple profile report returned by th.profile():

Attribute Type Description
source str Source file or data name
row_count int Number of rows
column_count int Number of columns
size_bytes int Estimated size in bytes
columns list[dict] Column summary dicts with name, dtype, null_pct, unique_pct, min, max

Methods: - print() - Print formatted report to console - to_dict() - Convert to dictionary - to_json() - Convert to JSON string

TableProfile (from DataProfiler)

The detailed profile returned by DataProfiler.profile():

Attribute Type Description
name str Table/dataset name
row_count int Number of rows
column_count int Number of columns
estimated_memory_bytes int Memory estimate
columns tuple[ColumnProfile] Detailed column profiles
duplicate_row_count int Number of duplicate rows
duplicate_row_ratio float Duplicate row ratio
correlations tuple Column correlation pairs
profile_duration_ms float Profiling time

ColumnProfile

Detailed profile for a single column:

Attribute Type Description
name str Column name
physical_type str Polars dtype as string
inferred_type DataType Semantic type (email, url, phone, etc.)
null_count int Null value count
null_ratio float Null ratio (0.0-1.0)
distinct_count int Unique value count
unique_ratio float Uniqueness ratio
is_unique bool True if all values unique
is_constant bool True if all values same
distribution DistributionStats Numeric statistics (mean, std, min, max, etc.)
top_values tuple[ValueFrequency] Most frequent values
min_length / max_length int String length bounds
detected_patterns tuple[PatternMatch] Detected data patterns
suggested_validators tuple[str] Recommended validators

ProfilerConfig

Configuration options for profiling:

Attribute Type Default Description
sample_size int \| None None Rows to sample
random_seed int 42 Sampling seed
include_patterns bool True Detect patterns
include_correlations bool False Calculate correlations
include_distributions bool True Calculate distribution stats
top_n_values int 10 Top N frequent values
n_jobs int 1 Parallel threads
correlation_threshold float 0.7 Minimum correlation to report

TableAnalyzer Protocol

Base protocol for implementing custom table-level analyzers:

Method Parameters Return Description
analyze lf: LazyFrame, config: ProfilerConfig dict Execute analysis and return results
Attribute Type Description
name str Unique identifier for the analyzer

DriftReport (from th.compare)

The drift report returned by th.compare():

Attribute Type Description
baseline_source str Baseline data source name
current_source str Current data source name
baseline_rows int Number of rows in baseline
current_rows int Number of rows in current
columns list[ColumnDrift] Per-column drift results

Properties: - has_drift - True if any column has drift - has_high_drift - True if any column has high drift

Methods: - print() - Print formatted report to console - to_dict() - Convert to dictionary - to_json() - Convert to JSON string - get_drifted_columns() - Get list of drifted column names

ProfileComparison (from ProfileComparator)

The detailed comparison result returned by ProfileComparator.compare():

Attribute Type Description
has_drift bool Whether any drift was detected
drift_count int Total number of detected drifts
columns list[ColumnComparison] Per-column comparison results
all_drifts list[Drift] Flattened list of all drifts

Methods: - get_by_type(drift_type) - Filter drifts by DriftType - get_by_severity(severity) - Filter drifts by DriftSeverity - get_column(name) - Get comparison for specific column - to_report() - Generate formatted text report

DriftType Enumeration

Value Description
COMPLETENESS Changes in null ratio
DISTRIBUTION Changes in statistical distribution
RANGE Changes in min/max values
CARDINALITY Changes in distinct value count
UNIQUENESS Changes in uniqueness ratio

DriftSeverity Enumeration

Value Description
INFO Minor change, informational only
WARNING Moderate change, may require attention
CRITICAL Significant change, likely requires action

DriftThresholds

Configuration for drift detection sensitivity:

Attribute Type Default Description
null_ratio_warning float 0.05 Null ratio change for warning
null_ratio_critical float 0.1 Null ratio change for critical
mean_warning float 0.1 Mean change ratio for warning
mean_critical float 0.2 Mean change ratio for critical

CLI Commands Reference

Command Description
truthound profile <file> Basic profile
truthound auto-profile <file> Profile with pattern detection
truthound generate-suite <profile> Generate rules from profile
truthound quick-suite <file> One-step profile + rules
truthound compare <baseline> <current> Compare datasets for drift

API Summary

The following table summarizes the two complementary approaches for data profiling and comparison:

Use Case High-Level API Advanced API
Basic profiling th.profile()ProfileReport DataProfiler.profile()TableProfile
Drift detection th.compare()DriftReport ProfileComparator.compare()ProfileComparison
Convenience function - compare_profiles()

The high-level API is recommended for standard use cases, while the advanced API provides additional capabilities for custom analysis pipelines, threshold configuration, and integration with scheduling systems.

Next Steps