Data Sources¶

The Data Sources module provides systematic management capabilities for connecting to, configuring, and validating data sources within Truthound Dashboard.

Overview¶

Data sources constitute the fundamental entities upon which all validation, profiling, and quality monitoring operations are conducted. The system accommodates a diverse range of data source types, encompassing file-based sources (CSV, Parquet, JSON) and database connections (PostgreSQL, MySQL, Snowflake, BigQuery).

Source Listing Interface¶

Source Listing¶

The primary Sources page renders all registered data sources in a card-based layout. Each source card presents the following informational elements:

Element	Description
Source Name	User-defined identifier for the data source
Type Badge	Visual indicator of the connection type
Description	Optional descriptive text clarifying the source's purpose
Last Validation	Timestamp of the most recent validation execution
Status Indicator	Color-coded badge reflecting validation status

Available Actions¶

From the source listing, practitioners may execute the following operations:

Add Source: Opens a dialog for registering a new data source
Validate: Initiates validation using the default validator configuration
Delete: Removes the source and all associated metadata (with confirmation)
View Details: Navigates to the comprehensive Source Detail Management page

Data Source Registration¶

Source Creation Dialog¶

The source registration workflow collects the following information:

Source Name (required): A unique identifier for the data source
Source Type (required): Selection from supported connection types
Description (optional): Explanatory text for documentation purposes
Configuration (required): Type-specific connection parameters

Supported Source Types¶

Type	Configuration Parameters
CSV	`path`: File system path to the CSV file
Parquet	`path`: File system path to the Parquet file
JSON	`path`: File system path to the JSON file
PostgreSQL	`host`, `port`, `database`, `username`, `password`, `table`
MySQL	`host`, `port`, `database`, `username`, `password`, `table`
Snowflake	`account`, `warehouse`, `database`, `schema`, `username`, `password`, `table`
BigQuery	`project`, `dataset`, `table`, `credentials_path`

Configuration Examples¶

File Source (CSV):

{
  "path": "/data/sales/transactions.csv"
}

Database Source (PostgreSQL):

{
  "host": "localhost",
  "port": 5432,
  "database": "analytics",
  "username": "readonly_user",
  "password": "secure_password",
  "table": "customer_orders"
}

Source Detail Management Interface¶

The Source Detail Management page provides systematic management and monitoring capabilities for individual data sources.

Information Tabs¶

Connection Info Tab¶

This tab displays the source configuration with appropriate security measures applied:

Sensitive fields (passwords, tokens, API keys) are masked by default
A toggle visibility option is provided for authorized review
Connection type and configuration summary are presented

Validation History Tab¶

A chronological record of all validation executions is presented in tabular form:

Column	Description
Timestamp	Date and time of validation execution
Status	Pass/fail indicator
Issues Count	Total number of identified issues
Duration	Execution time in seconds
Actions	View detailed results

Schema Tab¶

The current schema definition for the source is displayed, comprising:

Column names and data types
Nullable constraints
Unique constraints
Value constraints (min/max, allowed values)

Supported Operations¶

Test Connection¶

Connectivity to the data source is verified without executing validation:

Click the Test Connection button
The system attempts to establish a connection using stored credentials
A success or failure notification is displayed
For failures, error details are provided to assist in troubleshooting

All registered source types support connection testing. For file-based sources (csv, parquet, json, ndjson, jsonl), the test verifies that the specified file path exists and reports the file size. For database and external service sources, the test establishes a live connection and retrieves metadata including column count and row count.

Learn Schema¶

A schema definition is automatically generated by analyzing the data source:

Click the Learn Schema button
The system samples the data source to infer column types and constraints
The generated schema is displayed for review
The schema may be modified manually if required

Quick Validate¶

Validation is executed using the default validator configuration:

Click the Quick Validate button
The system executes all applicable validators
Results are displayed upon completion
A validation record is appended to the history

Configure & Run Validation¶

Granular control over the validation execution process is provided:

Click the Configure & Run button
Select validators to execute from the validator registry (150+ available)
Configure validator-specific parameters (thresholds, columns, etc.)
Execute validation with the custom configuration
Review results with a detailed issue breakdown

Preset Templates¶

The validator configuration dialog offers preset templates for commonly encountered use cases:

Template	Description
All Validators	Executes all applicable validators
Quick Check	Essential validators for rapid assessment
Schema Only	Schema structure validation only
Data Quality	Comprehensive data quality validators

Edit Source¶

Source configuration may be modified as follows:

Click the Edit button
Update the source name, description, or configuration
Save changes
Re-test the connection if the configuration was modified

Validation Status Indicators¶

Status	Color	Description
Success	Green	Validation completed with no critical or high-severity issues
Failed	Red	Validation identified critical or high-severity issues
Warning	Yellow	Validation completed with medium or low-severity issues
Pending	Gray	No validation has been executed

Security Architecture¶

Credential Storage¶

Connection credentials are encrypted using Fernet symmetric encryption prior to storage. The encryption key is automatically generated and stored with restricted file permissions within the Truthound data directory.

Credential Display¶

Sensitive configuration fields are masked within the user interface by default. Practitioners must explicitly toggle visibility to inspect credential values, thereby providing protection against shoulder-surfing attacks.

API Reference¶

Endpoint	Method	Description
`/sources`	GET	List all data sources
`/sources`	POST	Create a new data source
`/sources/{id}`	GET	Retrieve source details
`/sources/{id}`	PUT	Update source configuration
`/sources/{id}`	DELETE	Delete a data source
`/sources/{id}/test`	POST	Test connection
`/validations/sources/{id}/validate`	POST	Execute validation
`/sources/{id}/learn`	POST	Generate schema automatically
`/sources/{id}/schema`	GET	Retrieve current schema
`/sources/{id}/profile`	POST	Generate basic data profile
`/sources/{id}/profile/latest`	GET	Retrieve the most recent profile result
`/sources/{id}/profile/advanced`	POST	Generate data profile with advanced configuration
`/scans/sources/{id}/scan`	POST	Scan for PII
`/masks/sources/{id}/mask`	POST	Mask sensitive data
`/drift/compare`	POST	Compare two sources for drift

Extended API Parameter Specifications¶

The Dashboard extends the core Truthound library functions with additional parameters to facilitate enhanced flexibility. These extensions are made available through the REST API.

Schema Learning (`/sources/{id}/learn`)¶

This endpoint wraps th.learn() for automatic schema generation.

Parameter	Type	Default	Description
`infer_constraints`	`bool`	`true`	Infer min/max and allowed values from data
`categorical_threshold`	`int`	`20`	Max unique values for categorical detection (1-1000)

Example Request:

{
  "infer_constraints": true,
  "categorical_threshold": 50
}

Validation (`/validations/sources/{id}/validate`)¶

This endpoint wraps th.check() for data validation with configurable parameters.

Parameter	Type	Default	Description
`validators`	`list[str]`	`null`	Specific validators to run
`validator_config`	`dict`	`null`	Per-validator configuration (truthound 2.x format)
`min_severity`	`str`	`null`	Minimum severity to report (low/medium/high/critical)
`parallel`	`bool`	`false`	Enable parallel execution
`max_workers`	`int`	`null`	Max threads for parallel execution
`pushdown`	`bool`	`null`	Enable query pushdown for SQL sources
`schema`	`str`	`null`	Path to schema YAML file
`auto_schema`	`bool`	`false`	Auto-learn schema if not present
`custom_validators`	`list`	`null`	Custom validator configurations

Example Request:

{
  "validators": ["null", "duplicate", "range"],
  "validator_config": {
    "range": {"columns": {"age": {"min": 0, "max": 150}}}
  },
  "min_severity": "medium",
  "parallel": true,
  "max_workers": 4
}

PII Scanning (`/scans/sources/{id}/scan`)¶

This endpoint wraps th.scan() for PII detection.

Note: truthound's th.scan() does not support configuration parameters. The scan is automatically executed on all columns with default settings, detecting all supported PII types.

Example Request:

{}

Data Masking (`/masks/sources/{id}/mask`)¶

This endpoint wraps th.mask() for data protection.

Parameter	Type	Default	Description
`columns`	`list[str]`	`null`	Columns to mask (auto-detect if null)
`strategy`	`str`	`"redact"`	Masking strategy (redact/hash/fake)

Note: truthound's th.mask() does not support output format selection. The output is invariably generated in CSV format.

Example Request:

{
  "columns": ["ssn", "credit_card", "email"],
  "strategy": "hash"
}

Data Profiling (`/sources/{id}/profile`)¶

This endpoint wraps th.profile() for basic data profiling with default settings.

Example Request:

{}

Result Persistence and Automatic Retrieval¶

Every profiling execution -- whether basic or advanced -- is automatically persisted to the database upon completion. This architectural design ensures that profile results remain durable across user sessions and browser navigation events.

When the Profile page is loaded, the system automatically retrieves the most recently stored profile via GET /sources/{id}/profile/latest. Consequently, practitioners observe the last profiling result immediately upon page entry without requiring re-execution. If no prior profile exists for the given source, the page is rendered in its initial empty state, prompting the practitioner to initiate profiling.

The profile history is independently accessible through GET /sources/{id}/profiles, which returns a paginated list of all stored profile summaries ordered by creation timestamp in descending order.

Advanced Data Profiling (`/sources/{id}/profile/advanced`)¶

This endpoint utilizes truthound's ProfilerConfig for fine-grained control over profiling behavior.

Parameter	Type	Default	Description
`sample_size`	`int`	`null`	Maximum rows to sample (null for all rows)
`random_seed`	`int`	`42`	Random seed for reproducible sampling
`include_patterns`	`bool`	`true`	Enable pattern detection (email, phone, uuid, etc.)
`include_correlations`	`bool`	`false`	Calculate column correlations
`include_distributions`	`bool`	`true`	Include value distribution histograms
`top_n_values`	`int`	`10`	Number of top values to return per column
`pattern_sample_size`	`int`	`1000`	Sample size for pattern detection
`correlation_threshold`	`float`	`0.7`	Minimum correlation to report
`min_pattern_match_ratio`	`float`	`0.8`	Minimum match ratio for pattern detection
`n_jobs`	`int`	`1`	Number of parallel jobs for profiling

Example Request:

{
  "sample_size": 50000,
  "include_patterns": true,
  "include_correlations": true,
  "include_distributions": true,
  "top_n_values": 20,
  "n_jobs": 4
}

Note: Advanced profiling requires truthound with ProfilerConfig support. If this capability is not available, the API returns a 501 error.

The profile response encompasses: - Column types and inferred semantic types - Null and unique value percentages - Statistical measures (min, max, mean, std, median, quartiles) - String length statistics - Detected patterns (email, phone, UUID, etc.) - Value distribution histograms - Column correlations (when include_correlations is set to true)

Drift Detection (`/drift/compare`)¶

This endpoint wraps th.compare() for distribution comparison between datasets.

Parameter	Type	Default	Description
`baseline_source_id`	`str`	Required	Baseline source ID
`current_source_id`	`str`	Required	Current source ID
`columns`	`list[str]`	`null`	Columns to compare
`method`	`str`	`"auto"`	Detection method (auto/ks/psi/chi2/js/kl/wasserstein/cvm/anderson/hellinger/bhattacharyya/tv/energy/mmd)
`threshold`	`float`	`null`	Custom drift threshold
`sample_size`	`int`	`null`	Sample size for large datasets

Detection Methods:

Method	Description	Best For
`auto`	Automatic selection based on dtype	General use
`ks`	Kolmogorov-Smirnov test	Continuous numeric
`psi`	Population Stability Index	ML monitoring
`chi2`	Chi-squared test	Categorical
`js`	Jensen-Shannon divergence	Any distribution
`kl`	Kullback-Leibler divergence	Information-theoretic
`wasserstein`	Wasserstein distance	Distribution shape
`cvm`	Cramer-von Mises test	Continuous distributions
`anderson`	Anderson-Darling test	Tail-sensitive detection
`hellinger`	Hellinger distance	Bounded metric
`bhattacharyya`	Bhattacharyya distance	Classification bounds
`tv`	Total Variation distance	Maximum difference
`energy`	Energy distance	Location/scale
`mmd`	Maximum Mean Discrepancy	High-dimensional

Example Request:

{
  "baseline_source_id": "abc-123",
  "current_source_id": "def-456",
  "columns": ["age", "income", "score"],
  "method": "psi",
  "sample_size": 10000
}

Data Sources¶

Overview¶

Source Listing Interface¶

Source Listing¶

Available Actions¶

Data Source Registration¶

Source Creation Dialog¶

Supported Source Types¶

Configuration Examples¶

Source Detail Management Interface¶

Information Tabs¶

Connection Info Tab¶

Validation History Tab¶

Schema Tab¶

Supported Operations¶

Test Connection¶

Learn Schema¶

Quick Validate¶

Configure & Run Validation¶

Preset Templates¶

Edit Source¶

Validation Status Indicators¶

Security Architecture¶

Credential Storage¶

Credential Display¶

API Reference¶

Extended API Parameter Specifications¶

Schema Learning (/sources/{id}/learn)¶

Validation (/validations/sources/{id}/validate)¶

PII Scanning (/scans/sources/{id}/scan)¶

Data Masking (/masks/sources/{id}/mask)¶

Data Profiling (/sources/{id}/profile)¶

Result Persistence and Automatic Retrieval¶

Advanced Data Profiling (/sources/{id}/profile/advanced)¶

Drift Detection (/drift/compare)¶

Schema Learning (`/sources/{id}/learn`)¶

Validation (`/validations/sources/{id}/validate`)¶

PII Scanning (`/scans/sources/{id}/scan`)¶

Data Masking (`/masks/sources/{id}/mask`)¶

Data Profiling (`/sources/{id}/profile`)¶

Advanced Data Profiling (`/sources/{id}/profile/advanced`)¶

Drift Detection (`/drift/compare`)¶