Upstream Source
This page is part of Truthound Orchestration 3.x.
Source repository: seadonggyun4/truthound-orchestration
Upstream docs path: docs/dagster/alerting-automation.md
Edit upstream page: Edit in orchestration
Dagster Alerting and Automation¶
Dagster can surface quality information through hooks, schedules, automation policies, and external operators. Truthound's Dagster package focuses on keeping the payload and SLA semantics stable while letting Dagster own the automation layer.
Who This Is For¶
- platform teams routing quality failures into operational alerting
- operators deciding how much automation should happen in Dagster itself
- teams moving from manual reruns to automation-driven response
When To Use It¶
Use this page when:
- you need SLA-aware quality operations in Dagster
- failures should trigger notifications or follow-up jobs
- you want to standardize response to warning vs failure severity
Prerequisites¶
SLAResourceor another quality result evaluation path- a Dagster deployment with jobs, schedules, or automation rules
- a defined notification or escalation policy
Minimal Quickstart¶
Use the SLA surface as the policy boundary:
from truthound_dagster import SLAConfig, SLAResource
sla = SLAResource(
default_config=SLAConfig(max_failure_rate=0.05).to_dict()
)
For richer environments, pair the SLA resource with metrics or logging hooks:
from truthound_dagster import CompositeSLAHook, LoggingSLAHook, MetricsSLAHook
hooks = CompositeSLAHook(hooks=[LoggingSLAHook(), MetricsSLAHook()])
Production Pattern¶
Recommended automation split:
| Concern | Owner |
|---|---|
| validation execution | Truthound resource/op/asset helper |
| run orchestration | Dagster job or asset automation |
| alert routing | SLA hooks plus operator tooling |
Checklist:
- define whether warnings should notify or only log
- route high-noise checks to metrics before paging
- keep remediation jobs separate from primary validation jobs
Failure Modes and Troubleshooting¶
| Symptom | Likely Cause | What To Do |
|---|---|---|
| too many alerts | low-value checks are treated as incidents | downgrade to warn or metrics-only |
| no operator context in alerts | automation reads only check pass/fail | include dataset and partition metadata |
| remediation loops | automated rerun policy restarts unstable quality jobs | cap retries and preserve manual review gates |