truthound auto-profile¶

Generate an advanced statistical profile of data with pattern detection and correlation analysis.

Synopsis¶

truthound auto-profile <file> [OPTIONS]

Arguments¶

Argument	Required	Description
`file`	Yes	Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL)

Options¶

Option	Short	Default	Description
`--output`	`-o`	None	Output file path
`--format`	`-f`	`console`	Output format (console, json, yaml)
`--patterns/--no-patterns`		`true`	Enable/disable pattern detection
`--correlations/--no-correlations`		`false`	Enable/disable correlation analysis
`--sample`	`-s`	None	Sample size for large datasets
`--top-n`		`10`	Number of top items to display

Description¶

The auto-profile command provides advanced data profiling beyond basic statistics:

Statistical Analysis: Mean, std, percentiles, skewness, kurtosis
Pattern Detection: Email, phone, URL, date formats, custom patterns
Correlation Analysis: Numeric column correlations
Distribution Analysis: Value frequency, entropy, cardinality
Anomaly Detection: Outlier identification

Examples¶

Basic Profiling¶

truthound auto-profile data.csv

Output:

Advanced Data Profile
=====================
File: data.csv
Rows: 10,000
Columns: 8

Column Analysis
───────────────────────────────────────────────────────────────────
Column        Type      Nulls    Unique    Patterns    Stats
───────────────────────────────────────────────────────────────────
id            Int64     0.0%     100.0%    -           μ=5000.5
email         String    1.0%     99.8%     email       -
phone         String    2.5%     95.0%     phone       -
age           Int64     5.0%     0.7%      -           μ=35.2 σ=12.8
salary        Float64   3.0%     45.0%     -           μ=65432 σ=15234
category      String    0.0%     0.05%     -           5 values
───────────────────────────────────────────────────────────────────

Top Patterns Detected:
  email: email (confidence: 98%)
  phone: phone_us (confidence: 95%)

With Correlation Analysis¶

truthound auto-profile data.csv --correlations

Additional output:

Correlation Matrix (numeric columns)
────────────────────────────────────────
           age      salary    tenure
age        1.00     0.45      0.72
salary     0.45     1.00      0.38
tenure     0.72     0.38      1.00
────────────────────────────────────────

Disable Pattern Detection¶

For faster profiling without pattern analysis:

truthound auto-profile data.csv --no-patterns

Sample Large Datasets¶

Profile a sample of large datasets:

truthound auto-profile large_data.parquet --sample 100000

JSON Output¶

truthound auto-profile data.csv --format json -o profile.json

Output file:

href="#__codelineno-8-1">{ "file": "data.csv", "row_count": 10000, "column_count": 8, "columns": [ { "name": "email", "dtype": "String", "null_ratio": 0.01, "unique_ratio": 0.998, "patterns": [ { "type": "email", "confidence": 0.98, "match_ratio": 0.97 } ], "top_values": [ {"value": "john@example.com", "count": 5}, {"value": "jane@test.org", "count": 3} ] }, { "name": "age", "dtype": "Int64", "null_ratio": 0.05, "statistics": { "mean": 35.2, "std": 12.8, "min": 18, "max": 85, "q25": 26, "median": 34, "q75": 44, "skewness": 0.45, "kurtosis": -0.32 } } ], "correlations": { "age_salary": 0.45, "age_tenure": 0.72 } }

YAML Output¶

truthound auto-profile data.csv --format yaml -o profile.yaml

Custom Top-N¶

Show top 20 values per column:

truthound auto-profile data.csv --top-n 20

Profile Contents¶

Basic Metrics¶

Metric	Description
`row_count`	Total number of rows
`column_count`	Total number of columns
`memory_usage`	Estimated memory usage

Column Metrics¶

Metric	Description	Types
`dtype`	Data type	All
`null_count`	Number of null values	All
`null_ratio`	Proportion of nulls	All
`unique_count`	Number of unique values	All
`unique_ratio`	Proportion of unique values	All
`top_values`	Most frequent values	All

Numeric Statistics¶

Metric	Description
`mean`	Arithmetic mean
`std`	Standard deviation
`min` / `max`	Range
`q25` / `median` / `q75`	Quartiles
`skewness`	Distribution asymmetry
`kurtosis`	Distribution tailedness
`entropy`	Information entropy

Pattern Detection¶

Automatically detected patterns:

Pattern	Description	Example
`email`	Email addresses	`john@example.com`
`phone`	Phone numbers	`+1-555-123-4567`
`phone_us`	US phone format	`(555) 123-4567`
`url`	URLs	`https://example.com`
`ip_address`	IP addresses	`192.168.1.1`
`date_iso`	ISO date format	`2024-01-15`
`uuid`	UUID format	`550e8400-e29b-41d4-a716-446655440000`
`credit_card`	Credit card numbers	`4111-1111-1111-1111`
`ssn`	US SSN	`123-45-6789`

Use Cases¶

1. Data Discovery¶

Understand unknown datasets:

truthound auto-profile new_dataset.csv --patterns --correlations --format json -o discovery.json

2. Pre-Processing Analysis¶

Identify data quality issues before processing:

truthound auto-profile raw_data.csv
# Check for high null ratios, unexpected patterns, etc.

3. Feature Engineering¶

Identify correlated features:

truthound auto-profile features.csv --correlations

4. Profile for Rule Generation¶

Generate profile for validation suite creation:

# Step 1: Profile
truthound auto-profile data.csv --format json -o profile.json

# Step 2: Generate rules
truthound generate-suite profile.json -o rules.yaml

Comparison with `profile`¶

Feature	`profile`	`auto-profile`
Basic statistics	Yes	Yes
Pattern detection	Basic	Advanced
Correlation analysis	No	Yes
Sampling	No	Yes
Top-N values	No	Yes
Output formats	console, json	console, json, yaml
Performance	Faster	More detailed

profile - Basic data profiling
generate-suite - Generate rules from profile
quick-suite - Profile + generate in one step