truthound auto-profile¶
Generate an advanced statistical profile of data with pattern detection and correlation analysis.
Synopsis¶
Arguments¶
| Argument | Required | Description |
|---|---|---|
file |
Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
Options¶
| Option | Short | Default | Description |
|---|---|---|---|
--output |
-o |
None | Output file path |
--format |
-f |
console |
Output format (console, json, yaml) |
--patterns/--no-patterns |
true |
Enable/disable pattern detection | |
--correlations/--no-correlations |
false |
Enable/disable correlation analysis | |
--sample |
-s |
None | Sample size for large datasets |
--top-n |
10 |
Number of top items to display |
Description¶
The auto-profile command provides advanced data profiling beyond basic statistics:
- Statistical Analysis: Mean, std, percentiles, skewness, kurtosis
- Pattern Detection: Email, phone, URL, date formats, custom patterns
- Correlation Analysis: Numeric column correlations
- Distribution Analysis: Value frequency, entropy, cardinality
- Anomaly Detection: Outlier identification
Examples¶
Basic Profiling¶
Output:
Advanced Data Profile
=====================
File: data.csv
Rows: 10,000
Columns: 8
Column Analysis
───────────────────────────────────────────────────────────────────
Column Type Nulls Unique Patterns Stats
───────────────────────────────────────────────────────────────────
id Int64 0.0% 100.0% - μ=5000.5
email String 1.0% 99.8% email -
phone String 2.5% 95.0% phone -
age Int64 5.0% 0.7% - μ=35.2 σ=12.8
salary Float64 3.0% 45.0% - μ=65432 σ=15234
category String 0.0% 0.05% - 5 values
───────────────────────────────────────────────────────────────────
Top Patterns Detected:
email: email (confidence: 98%)
phone: phone_us (confidence: 95%)
With Correlation Analysis¶
Additional output:
Correlation Matrix (numeric columns)
────────────────────────────────────────
age salary tenure
age 1.00 0.45 0.72
salary 0.45 1.00 0.38
tenure 0.72 0.38 1.00
────────────────────────────────────────
Disable Pattern Detection¶
For faster profiling without pattern analysis:
Sample Large Datasets¶
Profile a sample of large datasets:
JSON Output¶
Output file:
{
"file": "data.csv",
"row_count": 10000,
"column_count": 8,
"columns": [
{
"name": "email",
"dtype": "String",
"null_ratio": 0.01,
"unique_ratio": 0.998,
"patterns": [
{
"type": "email",
"confidence": 0.98,
"match_ratio": 0.97
}
],
"top_values": [
{"value": "john@example.com", "count": 5},
{"value": "jane@test.org", "count": 3}
]
},
{
"name": "age",
"dtype": "Int64",
"null_ratio": 0.05,
"statistics": {
"mean": 35.2,
"std": 12.8,
"min": 18,
"max": 85,
"q25": 26,
"median": 34,
"q75": 44,
"skewness": 0.45,
"kurtosis": -0.32
}
}
],
"correlations": {
"age_salary": 0.45,
"age_tenure": 0.72
}
}
YAML Output¶
Custom Top-N¶
Show top 20 values per column:
Profile Contents¶
Basic Metrics¶
| Metric | Description |
|---|---|
row_count |
Total number of rows |
column_count |
Total number of columns |
memory_usage |
Estimated memory usage |
Column Metrics¶
| Metric | Description | Types |
|---|---|---|
dtype |
Data type | All |
null_count |
Number of null values | All |
null_ratio |
Proportion of nulls | All |
unique_count |
Number of unique values | All |
unique_ratio |
Proportion of unique values | All |
top_values |
Most frequent values | All |
Numeric Statistics¶
| Metric | Description |
|---|---|
mean |
Arithmetic mean |
std |
Standard deviation |
min / max |
Range |
q25 / median / q75 |
Quartiles |
skewness |
Distribution asymmetry |
kurtosis |
Distribution tailedness |
entropy |
Information entropy |
Pattern Detection¶
Automatically detected patterns:
| Pattern | Description | Example |
|---|---|---|
email |
Email addresses | john@example.com |
phone |
Phone numbers | +1-555-123-4567 |
phone_us |
US phone format | (555) 123-4567 |
url |
URLs | https://example.com |
ip_address |
IP addresses | 192.168.1.1 |
date_iso |
ISO date format | 2024-01-15 |
uuid |
UUID format | 550e8400-e29b-41d4-a716-446655440000 |
credit_card |
Credit card numbers | 4111-1111-1111-1111 |
ssn |
US SSN | 123-45-6789 |
Use Cases¶
1. Data Discovery¶
Understand unknown datasets:
2. Pre-Processing Analysis¶
Identify data quality issues before processing:
3. Feature Engineering¶
Identify correlated features:
4. Profile for Rule Generation¶
Generate profile for validation suite creation:
# Step 1: Profile
truthound auto-profile data.csv --format json -o profile.json
# Step 2: Generate rules
truthound generate-suite profile.json -o rules.yaml
Comparison with profile¶
| Feature | profile |
auto-profile |
|---|---|---|
| Basic statistics | Yes | Yes |
| Pattern detection | Basic | Advanced |
| Correlation analysis | No | Yes |
| Sampling | No | Yes |
| Top-N values | No | Yes |
| Output formats | console, json | console, json, yaml |
| Performance | Faster | More detailed |
Related Commands¶
profile- Basic data profilinggenerate-suite- Generate rules from profilequick-suite- Profile + generate in one step