Data Sources¶
The Data Sources module provides systematic management capabilities for connecting to, configuring, and validating data sources within Truthound Dashboard.
Overview¶
Data sources constitute the fundamental entities upon which all validation, profiling, and quality monitoring operations are conducted. The system accommodates a diverse range of data source types, encompassing file-based sources (CSV, Parquet, JSON) and database connections (PostgreSQL, MySQL, Snowflake, BigQuery).
Source Listing Interface¶
Source Listing¶
The primary Sources page renders all registered data sources in a card-based layout. Each source card presents the following informational elements:
| Element | Description |
|---|---|
| Source Name | User-defined identifier for the data source |
| Type Badge | Visual indicator of the connection type |
| Description | Optional descriptive text clarifying the source's purpose |
| Last Validation | Timestamp of the most recent validation execution |
| Status Indicator | Color-coded badge reflecting validation status |
Available Actions¶
From the source listing, practitioners may execute the following operations:
- Add Source: Opens a dialog for registering a new data source
- Validate: Initiates validation using the default validator configuration
- Delete: Removes the source and all associated metadata (with confirmation)
- View Details: Navigates to the comprehensive Source Detail Management page
Data Source Registration¶
Source Creation Dialog¶
The source registration workflow collects the following information:
- Source Name (required): A unique identifier for the data source
- Source Type (required): Selection from supported connection types
- Description (optional): Explanatory text for documentation purposes
- Configuration (required): Type-specific connection parameters
Supported Source Types¶
| Type | Configuration Parameters |
|---|---|
| CSV | path: File system path to the CSV file |
| Parquet | path: File system path to the Parquet file |
| JSON | path: File system path to the JSON file |
| PostgreSQL | host, port, database, username, password, table |
| MySQL | host, port, database, username, password, table |
| Snowflake | account, warehouse, database, schema, username, password, table |
| BigQuery | project, dataset, table, credentials_path |
Configuration Examples¶
File Source (CSV):
Database Source (PostgreSQL):
{
"host": "localhost",
"port": 5432,
"database": "analytics",
"username": "readonly_user",
"password": "secure_password",
"table": "customer_orders"
}
Source Detail Management Interface¶
The Source Detail Management page provides systematic management and monitoring capabilities for individual data sources.
Information Tabs¶
Connection Info Tab¶
This tab displays the source configuration with appropriate security measures applied:
- Sensitive fields (passwords, tokens, API keys) are masked by default
- A toggle visibility option is provided for authorized review
- Connection type and configuration summary are presented
Validation History Tab¶
A chronological record of all validation executions is presented in tabular form:
| Column | Description |
|---|---|
| Timestamp | Date and time of validation execution |
| Status | Pass/fail indicator |
| Issues Count | Total number of identified issues |
| Duration | Execution time in seconds |
| Actions | View detailed results |
Schema Tab¶
The current schema definition for the source is displayed, comprising:
- Column names and data types
- Nullable constraints
- Unique constraints
- Value constraints (min/max, allowed values)
Supported Operations¶
Test Connection¶
Connectivity to the data source is verified without executing validation:
- Click the Test Connection button
- The system attempts to establish a connection using stored credentials
- A success or failure notification is displayed
- For failures, error details are provided to assist in troubleshooting
All registered source types support connection testing. For file-based sources (csv, parquet, json, ndjson, jsonl), the test verifies that the specified file path exists and reports the file size. For database and external service sources, the test establishes a live connection and retrieves metadata including column count and row count.
Learn Schema¶
A schema definition is automatically generated by analyzing the data source:
- Click the Learn Schema button
- The system samples the data source to infer column types and constraints
- The generated schema is displayed for review
- The schema may be modified manually if required
Quick Validate¶
Validation is executed using the default validator configuration:
- Click the Quick Validate button
- The system executes all applicable validators
- Results are displayed upon completion
- A validation record is appended to the history
Configure & Run Validation¶
Granular control over the validation execution process is provided:
- Click the Configure & Run button
- Select validators to execute from the validator registry (150+ available)
- Configure validator-specific parameters (thresholds, columns, etc.)
- Execute validation with the custom configuration
- Review results with a detailed issue breakdown
Preset Templates¶
The validator configuration dialog offers preset templates for commonly encountered use cases:
| Template | Description |
|---|---|
| All Validators | Executes all applicable validators |
| Quick Check | Essential validators for rapid assessment |
| Schema Only | Schema structure validation only |
| Data Quality | Comprehensive data quality validators |
Edit Source¶
Source configuration may be modified as follows:
- Click the Edit button
- Update the source name, description, or configuration
- Save changes
- Re-test the connection if the configuration was modified
Validation Status Indicators¶
| Status | Color | Description |
|---|---|---|
| Success | Green | Validation completed with no critical or high-severity issues |
| Failed | Red | Validation identified critical or high-severity issues |
| Warning | Yellow | Validation completed with medium or low-severity issues |
| Pending | Gray | No validation has been executed |
Security Architecture¶
Credential Storage¶
Connection credentials are encrypted using Fernet symmetric encryption prior to storage. The encryption key is automatically generated and stored with restricted file permissions within the Truthound data directory.
Credential Display¶
Sensitive configuration fields are masked within the user interface by default. Practitioners must explicitly toggle visibility to inspect credential values, thereby providing protection against shoulder-surfing attacks.
API Reference¶
| Endpoint | Method | Description |
|---|---|---|
/sources |
GET | List all data sources |
/sources |
POST | Create a new data source |
/sources/{id} |
GET | Retrieve source details |
/sources/{id} |
PUT | Update source configuration |
/sources/{id} |
DELETE | Delete a data source |
/sources/{id}/test |
POST | Test connection |
/validations/sources/{id}/validate |
POST | Execute validation |
/sources/{id}/learn |
POST | Generate schema automatically |
/sources/{id}/schema |
GET | Retrieve current schema |
/sources/{id}/profile |
POST | Generate basic data profile |
/sources/{id}/profile/latest |
GET | Retrieve the most recent profile result |
/sources/{id}/profile/advanced |
POST | Generate data profile with advanced configuration |
/scans/sources/{id}/scan |
POST | Scan for PII |
/masks/sources/{id}/mask |
POST | Mask sensitive data |
/drift/compare |
POST | Compare two sources for drift |
Extended API Parameter Specifications¶
The Dashboard extends the core Truthound library functions with additional parameters to facilitate enhanced flexibility. These extensions are made available through the REST API.
Schema Learning (/sources/{id}/learn)¶
This endpoint wraps th.learn() for automatic schema generation.
| Parameter | Type | Default | Description |
|---|---|---|---|
infer_constraints |
bool |
true |
Infer min/max and allowed values from data |
categorical_threshold |
int |
20 |
Max unique values for categorical detection (1-1000) |
Example Request:
Validation (/validations/sources/{id}/validate)¶
This endpoint wraps th.check() for data validation with configurable parameters.
| Parameter | Type | Default | Description |
|---|---|---|---|
validators |
list[str] |
null |
Specific validators to run |
validator_config |
dict |
null |
Per-validator configuration (truthound 2.x format) |
min_severity |
str |
null |
Minimum severity to report (low/medium/high/critical) |
parallel |
bool |
false |
Enable parallel execution |
max_workers |
int |
null |
Max threads for parallel execution |
pushdown |
bool |
null |
Enable query pushdown for SQL sources |
schema |
str |
null |
Path to schema YAML file |
auto_schema |
bool |
false |
Auto-learn schema if not present |
custom_validators |
list |
null |
Custom validator configurations |
Example Request:
{
"validators": ["null", "duplicate", "range"],
"validator_config": {
"range": {"columns": {"age": {"min": 0, "max": 150}}}
},
"min_severity": "medium",
"parallel": true,
"max_workers": 4
}
PII Scanning (/scans/sources/{id}/scan)¶
This endpoint wraps th.scan() for PII detection.
Note: truthound's
th.scan()does not support configuration parameters. The scan is automatically executed on all columns with default settings, detecting all supported PII types.
Example Request:
Data Masking (/masks/sources/{id}/mask)¶
This endpoint wraps th.mask() for data protection.
| Parameter | Type | Default | Description |
|---|---|---|---|
columns |
list[str] |
null |
Columns to mask (auto-detect if null) |
strategy |
str |
"redact" |
Masking strategy (redact/hash/fake) |
Note: truthound's
th.mask()does not support output format selection. The output is invariably generated in CSV format.
Example Request:
Data Profiling (/sources/{id}/profile)¶
This endpoint wraps th.profile() for basic data profiling with default settings.
Example Request:
Result Persistence and Automatic Retrieval¶
Every profiling execution -- whether basic or advanced -- is automatically persisted to the database upon completion. This architectural design ensures that profile results remain durable across user sessions and browser navigation events.
When the Profile page is loaded, the system automatically retrieves the most recently stored profile via GET /sources/{id}/profile/latest. Consequently, practitioners observe the last profiling result immediately upon page entry without requiring re-execution. If no prior profile exists for the given source, the page is rendered in its initial empty state, prompting the practitioner to initiate profiling.
The profile history is independently accessible through GET /sources/{id}/profiles, which returns a paginated list of all stored profile summaries ordered by creation timestamp in descending order.
Advanced Data Profiling (/sources/{id}/profile/advanced)¶
This endpoint utilizes truthound's ProfilerConfig for fine-grained control over profiling behavior.
| Parameter | Type | Default | Description |
|---|---|---|---|
sample_size |
int |
null |
Maximum rows to sample (null for all rows) |
random_seed |
int |
42 |
Random seed for reproducible sampling |
include_patterns |
bool |
true |
Enable pattern detection (email, phone, uuid, etc.) |
include_correlations |
bool |
false |
Calculate column correlations |
include_distributions |
bool |
true |
Include value distribution histograms |
top_n_values |
int |
10 |
Number of top values to return per column |
pattern_sample_size |
int |
1000 |
Sample size for pattern detection |
correlation_threshold |
float |
0.7 |
Minimum correlation to report |
min_pattern_match_ratio |
float |
0.8 |
Minimum match ratio for pattern detection |
n_jobs |
int |
1 |
Number of parallel jobs for profiling |
Example Request:
{
"sample_size": 50000,
"include_patterns": true,
"include_correlations": true,
"include_distributions": true,
"top_n_values": 20,
"n_jobs": 4
}
Note: Advanced profiling requires truthound with ProfilerConfig support. If this capability is not available, the API returns a 501 error.
The profile response encompasses:
- Column types and inferred semantic types
- Null and unique value percentages
- Statistical measures (min, max, mean, std, median, quartiles)
- String length statistics
- Detected patterns (email, phone, UUID, etc.)
- Value distribution histograms
- Column correlations (when include_correlations is set to true)
Drift Detection (/drift/compare)¶
This endpoint wraps th.compare() for distribution comparison between datasets.
| Parameter | Type | Default | Description |
|---|---|---|---|
baseline_source_id |
str |
Required | Baseline source ID |
current_source_id |
str |
Required | Current source ID |
columns |
list[str] |
null |
Columns to compare |
method |
str |
"auto" |
Detection method (auto/ks/psi/chi2/js/kl/wasserstein/cvm/anderson/hellinger/bhattacharyya/tv/energy/mmd) |
threshold |
float |
null |
Custom drift threshold |
sample_size |
int |
null |
Sample size for large datasets |
Detection Methods:
| Method | Description | Best For |
|---|---|---|
auto |
Automatic selection based on dtype | General use |
ks |
Kolmogorov-Smirnov test | Continuous numeric |
psi |
Population Stability Index | ML monitoring |
chi2 |
Chi-squared test | Categorical |
js |
Jensen-Shannon divergence | Any distribution |
kl |
Kullback-Leibler divergence | Information-theoretic |
wasserstein |
Wasserstein distance | Distribution shape |
cvm |
Cramer-von Mises test | Continuous distributions |
anderson |
Anderson-Darling test | Tail-sensitive detection |
hellinger |
Hellinger distance | Bounded metric |
bhattacharyya |
Bhattacharyya distance | Classification bounds |
tv |
Total Variation distance | Maximum difference |
energy |
Energy distance | Location/scale |
mmd |
Maximum Mean Discrepancy | High-dimensional |
Example Request: