Skip to content

Upstream Source

This page is part of Truthound Orchestration 3.x.

Source repository: seadonggyun4/truthound-orchestration Upstream docs path: docs/dagster/alerting-automation.md Edit upstream page: Edit in orchestration

Dagster Alerting and Automation

Dagster can surface quality information through hooks, schedules, automation policies, and external operators. Truthound's Dagster package focuses on keeping the payload and SLA semantics stable while letting Dagster own the automation layer.

Who This Is For

  • platform teams routing quality failures into operational alerting
  • operators deciding how much automation should happen in Dagster itself
  • teams moving from manual reruns to automation-driven response

When To Use It

Use this page when:

  • you need SLA-aware quality operations in Dagster
  • failures should trigger notifications or follow-up jobs
  • you want to standardize response to warning vs failure severity

Prerequisites

  • SLAResource or another quality result evaluation path
  • a Dagster deployment with jobs, schedules, or automation rules
  • a defined notification or escalation policy

Minimal Quickstart

Use the SLA surface as the policy boundary:

from truthound_dagster import SLAConfig, SLAResource

sla = SLAResource(
    default_config=SLAConfig(max_failure_rate=0.05).to_dict()
)

For richer environments, pair the SLA resource with metrics or logging hooks:

from truthound_dagster import CompositeSLAHook, LoggingSLAHook, MetricsSLAHook

hooks = CompositeSLAHook(hooks=[LoggingSLAHook(), MetricsSLAHook()])

Production Pattern

Recommended automation split:

Concern Owner
validation execution Truthound resource/op/asset helper
run orchestration Dagster job or asset automation
alert routing SLA hooks plus operator tooling

Checklist:

  • define whether warnings should notify or only log
  • route high-noise checks to metrics before paging
  • keep remediation jobs separate from primary validation jobs

Failure Modes and Troubleshooting

Symptom Likely Cause What To Do
too many alerts low-value checks are treated as incidents downgrade to warn or metrics-only
no operator context in alerts automation reads only check pass/fail include dataset and partition metadata
remediation loops automated rerun policy restarts unstable quality jobs cap retries and preserve manual review gates