Quick Start¶

Get started with Truthound in 5 minutes!

Supported File Formats¶

Truthound CLI supports the following file formats:

Format	Extension	Description
CSV	`.csv`	Comma-separated values
JSON	`.json`	Newline-delimited JSON (via scan_ndjson)
Parquet	`.parquet`	Columnar storage format
NDJSON	`.ndjson`	Newline-delimited JSON
JSONL	`.jsonl`	JSON Lines (same as NDJSON)

Database and Cloud Data Sources

For SQL databases, Spark, or Cloud Data Warehouses (BigQuery, Snowflake, Redshift, Databricks), use the Python API with the source= parameter. See Data Sources for details.

Create Sample Data¶

First, let's create a sample CSV file:

import polars as pl

df = pl.DataFrame({
    "id": range(1, 101),
    "name": ["Alice", "Bob", None, "Charlie"] * 25,
    "email": ["alice@example.com", "bob@test.com", "invalid-email", "charlie@example.org"] * 25,
    "age": [25, 30, -5, 150] * 25,  # Contains invalid values
    "created_at": ["2024-01-01", "2024-13-45", "2024-02-28", "not-a-date"] * 25,
})

df.write_csv("sample_data.csv")

CLI Quick Start¶

1. Learn Schema¶

truthound learn sample_data.csv

This command analyzes the data and saves a schema file:

Schema saved to schema.yaml
  Columns: 5
  Rows: 100

Options:

Option	Description
`-o`, `--output`	Output schema file path (default: `schema.yaml`)
`--no-constraints`	Don't infer constraints from data

2. Check Data Quality¶

truthound check sample_data.csv

Validates data and displays issues found:

Truthound Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┓
┃ Column     ┃ Issue              ┃ Count ┃ Severity ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━┩
│ name       │ null               │    25 │   high   │
│ email      │ invalid_format     │    25 │   high   │
│ age        │ out_of_range       │    50 │  medium  │
└────────────┴────────────────────┴───────┴──────────┘

Summary: 3 issues found

Options:

Option	Description
`-v`, `--validators`	Comma-separated list of validators to run
`-s`, `--min-severity`	Minimum severity level (`low`, `medium`, `high`, `critical`)
`--schema`	Schema file for validation
`--auto-schema`	Auto-learn and cache schema (zero-config mode)
`-f`, `--format`	Output format (`console`, `json`, `html`)
`-o`, `--output`	Output file path
`--strict`	Exit with code 1 if issues are found

HTML format requires jinja2

Install with: pip install truthound[reports] or pip install jinja2

3. Scan for PII¶

truthound scan sample_data.csv

Detects personally identifiable information:

Truthound PII Scan
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ Column     ┃ PII Type      ┃ Count ┃ Confidence ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ email      │ Email Address │    75 │    99%     │
│ name       │ Person Name   │    75 │    85%     │
└────────────┴───────────────┴───────┴────────────┘

Warning: Found 2 columns with potential PII

Options:

Option	Description
`-f`, `--format`	Output format (`console`, `json`, `html`)
`-o`, `--output`	Output file path

HTML format requires jinja2

Install with: pip install truthound[reports] or pip install jinja2

Console output is the default

If no format is specified, console format is used with Rich formatting for better readability.

4. Profile Data¶

truthound profile sample_data.csv

Generates a statistical profile:

Truthound Profile
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Dataset: sample_data.csv
Rows: 100 | Columns: 5 | Size: 4.4 KB

┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━┳━━━━━┓
┃ Column     ┃ Type   ┃ Nulls ┃ Unique ┃ Min ┃ Max ┃
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━╇━━━━━┩
│ id         │ Int64  │  0.0% │   100% │   1 │ 100 │
│ name       │ String │ 25.0% │     4% │   - │   - │
│ email      │ String │  0.0% │     4% │   - │   - │
│ age        │ Int64  │  0.0% │     4% │  -5 │ 150 │
│ created_at │ String │  0.0% │     4% │   - │   - │
└────────────┴────────┴───────┴────────┴─────┴─────┘

Options:

Option	Description
`-f`, `--format`	Output format (`console`, `json`)
`-o`, `--output`	Output file path

Basic vs Advanced Profiling

The profile command provides basic statistics. For advanced profiling with pattern detection and type inference, use auto-profile instead.

5. Advanced Profiling (auto-profile)¶

For comprehensive profiling with pattern detection and type inference:

truthound auto-profile sample_data.csv -f json -o profile.json

Options:

Option	Description
`-f`, `--format`	Output format (`console`, `json`, `yaml`)
`-o`, `--output`	Output file path
`--patterns/--no-patterns`	Include pattern detection (default: enabled)
`--correlations/--no-correlations`	Include correlation analysis (default: disabled)
`-s`, `--sample`	Sample size for profiling (default: all rows)
`--top-n`	Number of top/bottom values to include (default: 10)

Python API Quick Start¶

Basic Usage¶

import truthound as th

# Check data quality
report = th.check("sample_data.csv")

if report.has_issues:
    print(f"Found {len(report.issues)} issues")
    for issue in report.issues:
        print(f"  [{issue.severity.value}] {issue.column}: {issue.issue_type}")
else:
    print("No issues found!")

# Print formatted report (uses Rich for pretty output)
report.print()

# Or get string representation
print(report)

With Schema Validation¶

import truthound as th
from truthound.schema import learn

# Learn schema from good data
schema = learn("reference_data.csv")
schema.save("schema.yaml")

# Validate new data against schema
report = th.check(
    "new_data.csv",
    schema="schema.yaml"
)

Profiling¶

import truthound as th

# Basic profiling - returns ProfileReport
profile = th.profile("sample_data.csv")

print(f"Source: {profile.source}")
print(f"Rows: {profile.row_count:,}")
print(f"Columns: {profile.column_count}")
print(f"Size: {profile.size_bytes:,} bytes")

# Column details
for col in profile.columns:
    print(f"\n{col['name']} ({col['dtype']}):")
    print(f"  Nulls: {col['null_pct']}")
    print(f"  Unique: {col['unique_pct']}")
    if col.get('min') and col.get('min') != "-":
        print(f"  Range: [{col['min']}, {col['max']}]")

# Print formatted report (uses Rich for pretty output)
profile.print()

Advanced Profiling with Pattern Detection¶

For comprehensive profiling including pattern detection and type inference:

from truthound.profiler import DataProfiler, ProfilerConfig

# Configure profiler
config = ProfilerConfig(
    include_patterns=True,
    include_correlations=False,
    sample_size=None,  # Use all rows
    top_n_values=10,
)

profiler = DataProfiler(config=config)

# Profile data
import polars as pl
lf = pl.scan_csv("sample_data.csv")
profile_result = profiler.profile(lf, name="sample", source="sample_data.csv")

# Access detailed profile
print(f"Columns: {profile_result.column_count}")
for col in profile_result.columns:
    print(f"{col.name}: {col.inferred_type.value}")
    if col.detected_patterns:
        print(f"  Patterns: {[p.pattern for p in col.detected_patterns[:3]]}")

Or use the CLI for advanced profiling:

truthound auto-profile sample_data.csv -f json -o profile.json

CI/CD Integration¶

GitHub Actions¶

# .github/workflows/data-quality.yml
name: Data Quality Check

on: [push, pull_request]

jobs:
  quality-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install Truthound
        run: pip install truthound

      - name: Check Data Quality
        run: truthound check data/*.csv --strict

Checkpoint Configuration¶

Create a checkpoint configuration file to define validation pipelines:

# truthound.yaml
checkpoints:
- name: daily_data_validation
  data_source: data/production.csv
  validators:
  - 'null'
  - duplicate
  - range
  - regex
  validator_config:
    regex:
      patterns:
        email: ^[\w.+-]+@[\w-]+\.[\w.-]+$
        product_code: ^[A-Z]{2,4}[-_][0-9]{3,6}$
        phone: ^(\+\d{1,3}[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$
    range:
      columns:
        age:
          min_value: 0
          max_value: 150
        price:
          min_value: 0
  min_severity: medium
  auto_schema: true
  tags:
    environment: production
    team: data-platform
  actions:
  - type: store_result
    store_path: ./truthound_results
    partition_by: date
  - type: update_docs
    site_path: ./truthound_docs
    include_history: true
  - type: slack
    webhook_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
    notify_on: failure
    channel: '#data-quality'
  triggers:
  - type: schedule
    interval_hours: 24
    run_on_weekdays: [0, 1, 2, 3, 4]

Run with:

truthound checkpoint run daily_data_validation --config truthound.yaml

Or run ad-hoc without a config file:

truthound checkpoint run quick_check \
    --data data/production.csv \
    --validators null,range \
    --strict \
    --slack https://hooks.slack.com/services/...

Checkpoint CLI Options¶

Option	Description
`-c`, `--config`	Checkpoint configuration file (YAML/JSON)
`-d`, `--data`	Override data source path
`-v`, `--validators`	Override validators (comma-separated)
`-o`, `--output`	Output file for results (JSON)
`-f`, `--format`	Output format (`console`, `json`)
`--strict`	Exit with code 1 if issues are found
`--store`	Store results to directory
`--slack`	Slack webhook URL for notifications
`--webhook`	Webhook URL for notifications
`--github-summary`	Write GitHub Actions job summary

Next Steps¶

First Validation Tutorial - Detailed walkthrough
Validators Guide - All 289 built-in validators
CI/CD Integration - Advanced pipeline setup