Upstream Source
This page is part of Truthound Orchestration 3.x.
Source repository: seadonggyun4/truthound-orchestration
Upstream docs path: docs/common/source-resolution-cookbook.md
Edit upstream page: Edit in orchestration
Source Resolution Cookbook¶
resolve_data_source(...) is one of the most important shared-runtime entry
points because every adapter uses it to decide whether an input is a file, SQL
statement, dataframe, stream, callable, or URI-like reference.
Who This Is For¶
- platform teams normalizing data inputs across multiple hosts
- users moving the same quality intent between Python adapters and dbt
- operators diagnosing source-shape mismatches
When To Use It¶
Use this page when you need a practical mapping from “what I handed the adapter” to “what the shared runtime thinks this source is”.
Prerequisites¶
- familiarity with Source Resolution
- knowledge of whether your workflow starts from a file, SQL relation, dataframe, or stream
Minimal Quickstart¶
from common.runtime import resolve_data_source
local_source = resolve_data_source(data_path="data/users.parquet")
sql_source = resolve_data_source(sql="SELECT * FROM users")
The first resolves as a local-path style source. The second resolves as a SQL source that requires a host-native connection path.
Cookbook¶
Local Paths¶
Use local or mounted paths for the fastest onboarding path:
Best for:
- Airflow operators reading mounted files
- local Prefect and Mage development
- Kestra script tasks with staged files
SQL Statements¶
Use SQL when the host already has a native connection story:
Best for:
- Airflow +
connection_id - warehouse-backed quality tasks in host adapters
Not ideal for:
- local zero-config runs with no connection surface
Dataframes¶
If the host already loaded data into memory, the shared runtime accepts the frame directly:
Best for:
- Dagster assets and ops
- Prefect tasks and flows
- Mage transformers
Streams¶
Streaming and bounded-memory checks use the orchestration streaming surfaces:
from common.orchestration import StreamRequest, run_stream_check
request = StreamRequest(stream=my_iterable)
Best for:
- Airflow streaming operators
- Kestra stream scripts
- shared runtime streaming envelopes
Warehouse Relations In dbt¶
The dbt adapter does not resolve local files or in-memory frames. The runtime boundary is always the compiled relation:
ref("model_name")source("pkg", "table")- adapter-dispatched SQL relation names
Production Pattern¶
- Keep source shape explicit in runbooks and examples.
- Use local paths only when the runtime environment guarantees those paths.
- Use SQL only when credentials, retry policy, and timeout semantics are already host-native.
- Prefer dataframes in Dagster, Prefect, and Mage when upstream code already materialized the dataset.
- Use streaming APIs intentionally; do not fake streaming by chunking unbounded iterators without checkpoint semantics.
Failure Modes And Troubleshooting¶
| Symptom | Likely Cause | Read Next |
|---|---|---|
| SQL preflight says a connection is required | zero-config does not cover SQL execution | Preflight and Compatibility |
| local path works on laptop but fails in the host | runtime filesystem does not expose that path | host deployment docs and secret/volume config |
| dataframe metadata is missing | host bypassed result serialization helpers | Result Serialization |
| streaming job reprocesses rows after restart | checkpointing or resume state is not wired in | Streaming Validation |