Upstream Source
This page is part of Truthound Orchestration 3.x.
Source repository: seadonggyun4/truthound-orchestration
Upstream docs path: docs/airflow/index.md
Edit upstream page: Edit in orchestration
Airflow¶
Airflow is the best fit when Truthound validation needs to live inside scheduled DAGs, gate downstream tasks, or reuse existing Airflow connection and alerting patterns.
The Airflow package follows provider-style boundaries: operators, sensors, hooks, and SLA callbacks stay Airflow-native, while engine resolution, source normalization, and preflight behavior stay in the shared runtime.
Who This Is For¶
- you already operate DAG-based pipelines
- you want quality checks to behave like first-class Airflow tasks
- you need sensor-based gating before downstream work continues
- you want connection-backed SQL sources but still keep local-path onboarding simple
When To Use It¶
Use Airflow when:
- the quality boundary naturally belongs in a DAG
- downstream work must block, branch, or alert based on validation results
- existing Airflow connections and alerting policies should remain the operational control plane
- the team wants operators, hooks, and sensors instead of a new orchestration abstraction
Prefer Dagster for asset-graph-first work, Prefect for Python-first flows with persisted blocks, and dbt for warehouse-native test execution.
Prerequisites¶
truthound-orchestration[airflow]installed in the Airflow runtime- a supported Airflow/Python tuple from the compatibility matrix
- either a local/remote data path or an Airflow connection for SQL-backed checks
Minimal Quickstart¶
Install the supported Airflow surface:
Then create a basic operator:
from airflow import DAG
from truthound_airflow import DataQualityCheckOperator
with DAG("quality_pipeline", schedule="@daily", catchup=False) as dag:
check_users = DataQualityCheckOperator(
task_id="check_users",
data_path="/opt/airflow/data/users.parquet",
rules=[
{"column": "user_id", "type": "not_null"},
{"column": "email", "type": "unique"},
],
)
Add a sensor when a later task should wait on a quality threshold:
from truthound_airflow import DeferrableDataQualitySensor
wait_for_users = DeferrableDataQualitySensor(
task_id="wait_for_users_quality",
data_path="/opt/airflow/data/users.parquet",
rules=[{"column": "id", "check": "not_null"}],
)
Decision Table¶
| Need | Recommended Airflow Surface | Why |
|---|---|---|
| run a hard quality gate inside a DAG | DataQualityCheckOperator |
behaves like a normal task failure boundary |
| wait until a quality condition becomes true | DataQualitySensor or DeferrableDataQualitySensor |
keeps gating host-native |
| execute connection-aware loads | DataQualityHook |
preserves Airflow connection patterns |
| emit threshold-based notifications | SLA callbacks | keeps alert routing separate from task logic |
| move large data through bounded-memory validation | DataQualityStreamOperator |
avoids loading everything into memory |
Execution Lifecycle¶
flowchart LR
A["DAG task starts"] --> B["Source resolved by shared runtime"]
B --> C["Preflight and compatibility check"]
C --> D["Engine created for the selected operation"]
D --> E["Check/Profile/Learn/Stream execution"]
E --> F["Shared result serialized"]
F --> G["Airflow XCom, logs, callbacks, and downstream tasks"]
Result Surface¶
- check tasks push shared Truthound results into Airflow-native transport layers such as XCom
- the default check XCom key is
data_quality_result - sensors and callbacks should consume the documented serialized payload instead of reconstructing status from logs alone
- downstream tasks can read the same canonical status, count, and metadata fields used in the other adapters
Config Surface¶
| Config Area | Airflow Boundary |
|---|---|
| source location | data_path= or sql= on the operator/sensor |
| credentials | connection_id= and Airflow connection metadata |
| execution engine | engine_name= or the default Truthound path |
| timeout and strictness | operator config such as timeout_seconds and operation-specific settings |
| result transport | xcom_push_key= and callback wiring |
What Zero-Config Covers¶
- local file paths do not need an Airflow connection
- Truthound remains the default engine
- omitted persistence stays ephemeral
- shared preflight still runs before execution
What it does not cover:
- SQL execution without an Airflow connection
- secret discovery outside standard Airflow patterns
- persistent Truthound workspace behavior unless you configure it explicitly
Primary Components¶
| Component | Use It For |
|---|---|
DataQualityCheckOperator |
row-level validation and pass/fail gating |
DataQualityProfileOperator |
profiling and shape inspection |
DataQualityLearnOperator |
learning rules from baseline data |
DataQualityStreamOperator |
bounded-memory streaming checks |
DataQualitySensor / DeferrableDataQualitySensor |
waiting for a quality threshold before continuing |
DataQualityHook / TruthoundHook |
source loading and connection-aware execution |
| SLA callbacks and monitors | warning, escalation, and alert routing |
Production Pattern¶
- Use operators for execution and hooks for source or connection concerns.
- Prefer deferrable or reschedule-friendly patterns for long waits.
- Keep XCom payload handling on the shared result contract instead of ad hoc serialization.
- Treat local files as the onboarding path and connections as the production path.
- Keep alerting in callbacks instead of duplicating it in every downstream task.
Production Checklist¶
- pin a supported Airflow/Python tuple from the compatibility matrix
- confirm whether the source path requires an Airflow connection
- use shared rule modules for repeated DAG patterns
- choose explicit XCom keys when downstream tasks rely on them
- use deferrable sensors for long waits
- wire alerting through callbacks or SLA monitors before going live
Failure Modes and Troubleshooting¶
| Symptom | Likely Cause | What To Do |
|---|---|---|
| SQL checks fail while file checks pass | missing or incorrect connection_id |
verify the Airflow connection and secret source |
| downstream tasks cannot find the result | custom xcom_push_key differs from the consumer expectation |
align the key across producer and consumer |
| sensor ties up workers | active waiting in a long-running quality gate | switch to the deferrable sensor where available |
| alerting is inconsistent across DAGs | callbacks are configured ad hoc | centralize callback chains and reuse them |