Upstream Source

This page is part of Truthound Orchestration 3.x.

Source repository: seadonggyun4/truthound-orchestration Upstream docs path: docs/common/source-resolution-cookbook.md Edit upstream page: Edit in orchestration

Source Resolution Cookbook¶

resolve_data_source(...) is one of the most important shared-runtime entry points because every adapter uses it to decide whether an input is a file, SQL statement, dataframe, stream, callable, or URI-like reference.

Who This Is For¶

platform teams normalizing data inputs across multiple hosts
users moving the same quality intent between Python adapters and dbt
operators diagnosing source-shape mismatches

When To Use It¶

Use this page when you need a practical mapping from “what I handed the adapter” to “what the shared runtime thinks this source is”.

Prerequisites¶

familiarity with Source Resolution
knowledge of whether your workflow starts from a file, SQL relation, dataframe, or stream

Minimal Quickstart¶

from common.runtime import resolve_data_source

local_source = resolve_data_source(data_path="data/users.parquet")
sql_source = resolve_data_source(sql="SELECT * FROM users")

The first resolves as a local-path style source. The second resolves as a SQL source that requires a host-native connection path.

Cookbook¶

Local Paths¶

Use local or mounted paths for the fastest onboarding path:

source = resolve_data_source(data_path="/opt/data/users.parquet")

Best for:

Airflow operators reading mounted files
local Prefect and Mage development
Kestra script tasks with staged files

SQL Statements¶

Use SQL when the host already has a native connection story:

source = resolve_data_source(sql="SELECT * FROM users WHERE ds = CURRENT_DATE")

Best for:

Airflow + connection_id
warehouse-backed quality tasks in host adapters

Not ideal for:

local zero-config runs with no connection surface

Dataframes¶

If the host already loaded data into memory, the shared runtime accepts the frame directly:

result = data_quality.check(dataframe, rules=[{"column": "id", "type": "not_null"}])

Best for:

Dagster assets and ops
Prefect tasks and flows
Mage transformers

Streams¶

Streaming and bounded-memory checks use the orchestration streaming surfaces:

from common.orchestration import StreamRequest, run_stream_check

request = StreamRequest(stream=my_iterable)

Best for:

Airflow streaming operators
Kestra stream scripts
shared runtime streaming envelopes

Warehouse Relations In dbt¶

The dbt adapter does not resolve local files or in-memory frames. The runtime boundary is always the compiled relation:

ref("model_name")
source("pkg", "table")
adapter-dispatched SQL relation names

Production Pattern¶

Keep source shape explicit in runbooks and examples.
Use local paths only when the runtime environment guarantees those paths.
Use SQL only when credentials, retry policy, and timeout semantics are already host-native.
Prefer dataframes in Dagster, Prefect, and Mage when upstream code already materialized the dataset.
Use streaming APIs intentionally; do not fake streaming by chunking unbounded iterators without checkpoint semantics.

Failure Modes And Troubleshooting¶

Symptom	Likely Cause	Read Next
SQL preflight says a connection is required	zero-config does not cover SQL execution	Preflight and Compatibility
local path works on laptop but fails in the host	runtime filesystem does not expose that path	host deployment docs and secret/volume config
dataframe metadata is missing	host bypassed result serialization helpers	Result Serialization
streaming job reprocesses rows after restart	checkpointing or resume state is not wired in	Streaming Validation

Source Resolution Cookbook¶

Who This Is For¶

When To Use It¶

Prerequisites¶

Minimal Quickstart¶

Cookbook¶

Local Paths¶

SQL Statements¶

Dataframes¶

Streams¶

Warehouse Relations In dbt¶

Production Pattern¶

Failure Modes And Troubleshooting¶

Related Pages¶