Skip to content

Advanced Notification Orchestration

The Advanced Notifications module provides a comprehensive notification orchestration framework encompassing content-based routing, deduplication, rate limiting, and multi-level escalation policies. This document presents the theoretical foundations, algorithmic formulations, and practical implementation guidelines that underpin enterprise-scale notification management.

1. Introduction

1.1 Problem Statement

Contemporary data quality monitoring systems are observed to generate substantial volumes of alerts during routine operation. In the absence of appropriate orchestration mechanisms, organizations are subjected to a number of well-documented operational challenges:

Challenge Impact
Alert Fatigue Operators become desensitized to frequent notifications
Duplicate Notifications Same issue triggers multiple redundant alerts
Notification Storms Cascading failures overwhelm communication channels
Delayed Escalation Critical issues not escalated to appropriate personnel

1.2 Solution Architecture

The aforementioned challenges are addressed through the provision of four complementary subsystems, each of which is supported by a corresponding module within the truthound library's checkpoint infrastructure:

Component Truthound Module Purpose
Routing truthound.checkpoint.routing Content-based notification distribution
Deduplication truthound.checkpoint.deduplication Duplicate suppression via fingerprinting
Throttling truthound.checkpoint.throttling Rate limiting via token bucket algorithms
Escalation truthound.checkpoint.escalation Progressive notification with state machine

1.3 Processing Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│                  Notification Processing Pipeline                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐                                                   │
│  │    Event     │  ValidationFailedEvent, DriftDetectedEvent,       │
│  │   Trigger    │  ScheduleFailedEvent, SchemaChangedEvent          │
│  └──────┬───────┘                                                   │
│         │                                                            │
│         ▼                                                            │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │           TruthoundNotificationAdapter                       │    │
│  │  ┌────────────┐  ┌─────────────┐  ┌────────────┐            │    │
│  │  │  Routing   │→ │Deduplication│→ │ Throttling │            │    │
│  │  │(ActionRouter│  │(Fingerprint)│  │(TokenBucket)│           │    │
│  │  └────────────┘  └─────────────┘  └────────────┘            │    │
│  └─────────────────────────┬───────────────────────────────────┘    │
│                            │                                         │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              Channel Delivery Layer                          │    │
│  │  ┌──────┐ ┌───────┐ ┌───────┐ ┌─────────┐ ┌────────┐       │    │
│  │  │Slack │ │ Email │ │ Teams │ │PagerDuty│ │Webhook │       │    │
│  │  └──────┘ └───────┘ └───────┘ └─────────┘ └────────┘       │    │
│  └─────────────────────────┬───────────────────────────────────┘    │
│                            │                                         │
│                            ▼                                         │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │         Escalation Engine (State Machine)                    │    │
│  │  PENDING → ACTIVE → ESCALATING → ACKNOWLEDGED → RESOLVED    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

2. Routing Engine Architecture

2.1 Theoretical Foundation

Content-based routing (CBR) is a well-established message distribution paradigm in which routing decisions are determined by the content of messages rather than by predetermined destination addresses. The truthound routing engine implements a rule-based CBR system that is characterized by the following properties:

  • Declarative Rules: Routing rules are specified as predicates evaluated over event attributes
  • Priority-Ordered Evaluation: Rules are evaluated in strict priority order to ensure deterministic behavior
  • Composable Conditions: Rules may be combined through the application of standard logical operators

2.2 Rule Taxonomy

The truthound ActionRouter provides 11 built-in rule types and 3 combinators, which are enumerated below.

2.2.1 Primitive Rules

Rule Type Parameters Matching Logic
AlwaysRule None Always matches (default route)
NeverRule None Never matches (disabled route)
SeverityRule min_severity, max_severity, min_count Matches issues by severity level
IssueCountRule min_issues, max_issues, count_type Matches by issue count threshold
StatusRule statuses, negate Matches by checkpoint status
TagRule tags, match_all, negate Matches by resource tags
DataAssetRule pattern, is_regex, case_sensitive Matches by data asset name pattern
MetadataRule key_path, expected_value, comparator Matches by metadata values
TimeWindowRule start_time, end_time, days_of_week, timezone Matches by time period
PassRateRule min_rate, max_rate Matches by validation pass rate
ErrorRule pattern, negate Matches by error message pattern

2.2.2 Logical Combinators

Combinator Semantics Example Use Case
AllOf Logical conjunction (AND) Critical AND Production
AnyOf Logical disjunction (OR) Critical OR Error
NotRule Logical negation (NOT) NOT Development environment

2.3 Evaluation Modes

The ActionRouter supports three distinct evaluation modes, each of which is suited to particular operational requirements:

Mode Behavior Use Case
FIRST_MATCH Execute only the first matching route Mutually exclusive channels
ALL_MATCHES Execute all matching routes Multi-channel broadcast
PRIORITY_GROUP Execute all routes in the highest priority group Tiered response

2.4 Route Context Specification

The RouteContext data class encapsulates the complete set of attributes that are made available for rule evaluation:

@dataclass(frozen=True)
class RouteContext:
    checkpoint_name: str      # Checkpoint identifier
    run_id: str               # Execution run identifier
    status: str               # Result status (success, failure, error)
    data_asset: str           # Data asset name
    run_time: datetime        # Execution timestamp
    total_issues: int         # Total issue count
    critical_issues: int      # Critical severity count
    high_issues: int          # High severity count
    medium_issues: int        # Medium severity count
    low_issues: int           # Low severity count
    info_issues: int          # Info severity count
    pass_rate: float          # Validation pass rate (0-100)
    tags: dict[str, str]      # Resource tags
    metadata: dict[str, Any]  # Additional metadata
    validation_duration_ms: float  # Execution duration
    error: str | None         # Error message if applicable

2.5 Implementation Guidelines

Principles for Effective Route Construction

  1. Specificity Principle: It is recommended that more specific rules be assigned lower priority numbers to ensure preferential evaluation
  2. Default Route Provision: A catch-all route configured with AlwaysRule should always be included to guarantee notification delivery
  3. Pre-Deployment Validation: Rules should be validated using the /notifications/routing/rules/test endpoint prior to production deployment

Example Configuration

routes:
  - name: critical_production
    priority: 10
    rule:
      type: all_of
      rules:
        - type: severity
          min_severity: critical
        - type: tag
          tags: { env: production }
    actions:
      - pagerduty_channel_id
      - slack_critical_channel_id

  - name: high_severity_alerts
    priority: 50
    rule:
      type: severity
      min_severity: high
    actions:
      - slack_alerts_channel_id

  - name: default_route
    priority: 100
    rule:
      type: always
    actions:
      - email_team_channel_id

3. Deduplication Engine Architecture

3.1 Theoretical Foundation

Notification deduplication is concerned with the elimination of redundant alert generation. The approach employed herein is based on fingerprint-based duplicate detection within sliding time windows. This technique is analogous to content-addressable storage systems and bears similarity to bloom filter applications that have been widely studied in the context of distributed systems.

3.1.1 Fingerprint Generation

A notification fingerprint is defined as a unique identifier derived from a specified subset of notification attributes:

fingerprint = hash(checkpoint_name || action_type || severity || data_asset)

The fingerprint function is required to satisfy the following properties: - Determinism: Identical inputs must produce identical outputs across all invocations - Collision Resistance: Distinct notifications should yield distinct fingerprints with high probability - Computational Efficiency: Generation must be achievable in O(1) time complexity

3.2 Deduplication Policies

The truthound NotificationDeduplicator supports five policies, arranged in order of increasing specificity:

Policy Fingerprint Components Use Case
NONE Disabled No deduplication
BASIC checkpoint_name + action_type Suppress same alert to same channel
SEVERITY BASIC + severity Differentiate by severity level
ISSUE_BASED SEVERITY + issue_types Differentiate by issue categories
STRICT Full notification hash Maximum differentiation

3.3 Windowing Strategies

Four time-based windowing strategies are provided, each of which exhibits distinct temporal characteristics.

3.3.1 Sliding Window Strategy

The sliding window approach maintains a fixed-duration window that advances continuously with time:

Time: ─────────────────────────────────────────────►
      │◄──── Window (5 min) ────►│
      │                          │
      │  Notification A          │  Notification B (duplicate)
      │  t=0                     │  t=2 min → SUPPRESSED
      │                          │

Characteristics: - The window is initiated from each notification event - The implementation is straightforward in nature - Memory consumption is considered efficient

3.3.2 Tumbling Window Strategy

This strategy employs non-overlapping fixed-duration buckets:

Time: ─────────────────────────────────────────────►
      │◄── Bucket 1 ──►│◄── Bucket 2 ──►│
      │   15 minutes    │   15 minutes    │
      │                 │                 │
      │  A (allowed)    │  A (allowed)    │
      │  A (suppressed) │                 │

Characteristics: - Bucket boundaries are fixed and predetermined - Suppression behavior is predictable and deterministic - Potential edge effects may be observed at bucket boundaries

3.3.3 Session Window Strategy

This approach employs event-driven sessions with gap-based expiration semantics:

Time: ─────────────────────────────────────────────►
      │◄─ Session 1 ─►│  gap  │◄─ Session 2 ─►│
      │               │ >10min │               │
      │ A B C         │        │ A             │
      │ (A,B,C dedup) │        │ (new session) │

Characteristics: - Window duration is dynamically determined based on notification activity - This strategy is particularly well-suited to bursty notification patterns - Elevated memory consumption may be observed relative to other strategies

3.4 Implementation Guidelines

Configuration Parameters

Parameter Recommended Value Rationale
Window Duration 300 seconds (5 min) Balances suppression vs. visibility
Policy SEVERITY Good balance of differentiation
Strategy sliding Simplest, most predictable behavior

Observational Metrics

Metric Formula Target Range
Suppression Ratio suppressed / total_evaluated 10-40%
Active Fingerprints Count of active entries < 1000

4. Throttling Engine Architecture

4.1 Theoretical Foundation

Rate limiting constitutes a fundamental technique for the protection of systems against overload conditions. The truthound throttling engine implements the Token Bucket Algorithm, a well-established approach that has been extensively studied and deployed in the domains of network traffic shaping and API rate limiting.

4.1.1 Token Bucket Algorithm

The token bucket model maintains a bucket with a maximum capacity of B tokens. Tokens are replenished at a rate of r tokens per second. Each notification consumes exactly one token upon processing. In the event that no tokens are available, the notification is subjected to throttling.

Mathematical Formulation:

tokens(t) = min(B, tokens(t-1) + r × Δt)

Where: - B = burst capacity - r = token replenishment rate - Δt = time elapsed since the last state update

Algorithmic Behavior:

Tokens: ████████████ (12 tokens, capacity)
        ─────────────────────────────────────►
        │ Request │ Request │ Request │ ...
        │ 12→11   │ 11→10   │ 10→9    │
        │         │         │ +0.5/sec │  (replenishment)

4.2 Throttler Implementations

Five throttler implementations are provided within the truthound library, each based on a distinct algorithmic approach:

Throttler Algorithm Characteristics
TokenBucketThrottler Token Bucket Allows burst, smooth rate limiting
SlidingWindowThrottler Sliding Window More accurate, no boundary effects
FixedWindowThrottler Fixed Window Simple, potential 2x burst at boundaries
CompositeThrottler Multi-level Combines multiple rate limits
NoOpThrottler Pass-through Testing/disable mode

4.3 Rate Limit Scopes

Rate limits may be applied at varying levels of granularity, as enumerated in the following table:

Scope Bucket Key Use Case
GLOBAL Single bucket Total notification limit
PER_ACTION action_type Per-channel limits
PER_CHECKPOINT checkpoint_name Per-source limits
PER_ACTION_CHECKPOINT action + checkpoint Fine-grained control
PER_SEVERITY severity Severity-based limits
PER_DATA_ASSET data_asset Asset-specific limits

4.4 Hierarchical Multi-Level Rate Limiting

The CompositeThrottler facilitates the construction of hierarchical rate limit configurations:

throttling:
  per_minute_limit: 10    # Short-term burst control
  per_hour_limit: 100     # Medium-term control
  per_day_limit: 500      # Long-term budget
  burst_multiplier: 1.5   # 50% burst allowance

Evaluation Logic:

Request arrives
  ├─► Check per_minute limit
  │     │
  │     ├─► PASS → Check per_hour limit
  │     │           │
  │     │           ├─► PASS → Check per_day limit
  │     │           │           │
  │     │           │           ├─► PASS → ALLOWED
  │     │           │           └─► FAIL → THROTTLED
  │     │           └─► FAIL → THROTTLED
  │     └─► FAIL → THROTTLED

4.5 Implementation Guidelines

Channel Type per_minute per_hour per_day Rationale
PagerDuty 5 20 100 On-call fatigue prevention
Slack 20 200 1000 Chat noise reduction
Email 5 50 200 Inbox management

Priority Bypass Mechanism

It is possible to configure critical notifications to bypass throttling constraints entirely:

priority_bypass: true
priority_threshold: critical

When this mechanism is enabled, notifications bearing severity=critical are permitted to bypass all rate limits.


5. Escalation Engine Architecture

5.1 Theoretical Foundation

Escalation management is implemented through a Finite State Machine (FSM) that governs the lifecycle of incident tracking. This approach is derived from established incident management frameworks, including ITIL, as well as contemporary Site Reliability Engineering (SRE) practices.

5.1.1 Formal State Machine Definition

The escalation state machine is formally defined as follows:

FSM = (S, Σ, δ, s₀, F)

Where:
  S = {PENDING, ACTIVE, ESCALATING, ACKNOWLEDGED, RESOLVED, CANCELLED, TIMED_OUT, FAILED}
  Σ = {start, ack, resolve, cancel, timeout, escalate, error}
  s₀ = PENDING
  F = {RESOLVED, CANCELLED, TIMED_OUT, FAILED}

5.1.2 State Transition Diagram

                    ┌─────────────────────────┐
                    │        PENDING          │
                    └────────────┬────────────┘
                                 │ start()
        ┌───────────────────────────────────────────────┐
        │                    ACTIVE                     │
        └────┬─────────────┬────────────────┬───────────┘
             │             │                │
             │ ack()       │ timeout        │ escalate()
             ▼             ▼                ▼
    ┌────────────┐  ┌────────────┐  ┌────────────────┐
    │ACKNOWLEDGED│  │ TIMED_OUT  │  │   ESCALATING   │
    └──────┬─────┘  └────────────┘  └───────┬────────┘
           │                                │
           │ resolve()                      │ next_level
           ▼                                ▼
    ┌────────────┐                  ┌────────────────┐
    │  RESOLVED  │                  │     ACTIVE     │
    └────────────┘                  │ (next level)   │
                                    └────────────────┘

        cancel() from any state → CANCELLED
        error from any state → FAILED

5.2 Escalation Level Configuration

Each escalation level is defined by the following parameter set:

Parameter Type Description
level int Level number (1 = first)
delay_minutes int Delay before escalating to next level
targets list[EscalationTarget] Notification recipients
repeat_count int Number of times to repeat at this level
repeat_interval_minutes int Interval between repeats
require_ack bool Whether acknowledgment is required
auto_resolve_minutes int Auto-resolve timeout (0 = disabled)

5.3 Target Type Classification

Escalation targets represent the notification recipients to which alerts are dispatched:

Target Type Identifier Format Description
user User ID Individual user
team Team ID Team/group
channel Channel ID Slack channel, etc.
schedule Schedule ID On-call schedule
webhook URL Webhook endpoint
email Email address Direct email
phone Phone number SMS/voice call

5.4 Trigger Condition Taxonomy

Escalation may be initiated by a variety of conditions, which are categorized as follows:

Trigger Description
UNACKNOWLEDGED Alert not acknowledged within timeout
UNRESOLVED Incident not resolved within timeout
SEVERITY_UPGRADE Severity level increased
REPEATED_FAILURE Same issue recurring
THRESHOLD_BREACH Metric exceeded threshold
MANUAL Manual trigger by operator
SCHEDULED Time-based trigger

5.5 Automated Escalation by Event Severity

The NotificationDispatcher is configured to automatically initiate escalation procedures for events of elevated severity:

Event Type Condition Escalation Policy
ValidationFailedEvent has_critical=true critical_alert
ValidationFailedEvent has_high=true high_alert
DriftDetectedEvent has_high_drift=true high_alert
ScheduleFailedEvent Always high_alert
SchemaChangedEvent has_breaking_changes=true high_alert

5.6 Implementation Guidelines

Principles for Escalation Policy Design

Principle Recommendation
Level Count 3-4 levels typically sufficient
Timeout Progression Exponential backoff (5min → 15min → 30min)
Final Level Must reach decision makers
Acknowledgment Require ack at all levels

Example Policy Configuration

escalation_policies:
  - name: critical_production
    description: Critical production issue escalation
    levels:
      - level: 1
        delay_minutes: 0
        targets:
          - type: user
            identifier: team-lead
            name: Team Lead
          - type: channel
            identifier: "#alerts-critical"
            name: Critical Alerts
        repeat_count: 2
        repeat_interval_minutes: 5
        require_ack: true

      - level: 2
        delay_minutes: 15
        targets:
          - type: user
            identifier: engineering-manager
            name: Engineering Manager
          - type: schedule
            identifier: primary-oncall
            name: Primary On-call

      - level: 3
        delay_minutes: 30
        targets:
          - type: user
            identifier: director
            name: Director of Engineering
          - type: email
            identifier: leadership@company.com
            name: Leadership Team

    triggers:
      - unacknowledged
    severity_filter:
      - critical
      - high
    cooldown_minutes: 60
    max_escalations: 5

6. Notification Channel Integration

6.1 Supported Channels

The dashboard is integrated with the following notification action implementations provided by the truthound library:

Channel Truthound Action Protocol
Slack SlackNotification Incoming Webhook
Email EmailNotification SMTP/SendGrid/SES
Microsoft Teams TeamsNotification Adaptive Cards
Discord DiscordNotification Embed Webhook
Telegram TelegramNotification Bot API
PagerDuty PagerDutyAction Events API v2
OpsGenie OpsGenieAction REST API
Webhook WebhookAction HTTP POST
GitHub GitHubAction Issues API

6.2 Channel Configuration Specifications

Each channel type is associated with specific configuration requirements, as detailed below.

Slack Configuration

Parameter Required Description
webhook_url Yes Slack Incoming Webhook URL
channel No Channel override (#channel)
username No Bot display name
icon_emoji No Bot icon (:emoji:)
mention_on_failure No User IDs to mention

Email Configuration

Parameter Required Description
from_address Yes Sender email address
to_addresses Yes Recipient email addresses
smtp_host Conditional SMTP server (if provider=smtp)
provider No smtp, sendgrid, or ses
api_key Conditional API key (if provider=sendgrid/ses)

7. Statistical Monitoring and Observability

7.1 Metric Categories

The Advanced Notifications system exposes two distinct categories of operational metrics:

Category Source Description
Dashboard Stats SQLite Database Configuration counts
Runtime Stats Truthound Library Processing metrics

7.2 Key Performance Indicators

KPI Formula Target Action if Exceeded
Dedup Ratio suppressed / evaluated 10-40% Review fingerprint strategy
Throttle Rate throttled / checked < 10% Increase rate limits
Ack Rate acknowledged / triggered > 80% Review escalation timeouts
MTTA avg(ack_time - trigger_time) < 15 min Review escalation targets
MTTR avg(resolve_time - trigger_time) < 60 min Review resolution process

7.3 Statistics API

GET /notifications/advanced/stats

Response:

{
  "routing": {
    "total_routes": 12,
    "mode": "all_matches"
  },
  "deduplication": {
    "total_evaluated": 1250,
    "suppressed": 340,
    "suppression_ratio": 0.272,
    "active_fingerprints": 45
  },
  "throttling": {
    "total_checked": 1250,
    "total_allowed": 1150,
    "total_throttled": 100,
    "throttle_rate": 0.08
  },
  "escalation": {
    "total_escalations": 23,
    "active_escalations": 2,
    "acknowledged_count": 18,
    "resolved_count": 15,
    "acknowledgment_rate": 0.78,
    "avg_time_to_acknowledge": 420
  }
}


8. Configuration Management

8.1 Configuration Export

All Advanced Notification configurations may be exported via the following endpoint:

GET /notifications/config/export?include_routing_rules=true&include_deduplication=true&include_throttling=true&include_escalation=true

8.2 Configuration Import

Configurations may be imported with configurable conflict resolution semantics:

POST /notifications/config/import
Content-Type: application/json

{
  "bundle": { ... },
  "conflict_resolution": "skip" | "overwrite" | "rename"
}

8.3 Configuration Bundle Schema

{
  "version": "1.0",
  "exported_at": "2025-01-29T12:00:00Z",
  "routing_rules": [...],
  "deduplication_configs": [...],
  "throttling_configs": [...],
  "escalation_policies": [...]
}

9. Template Library

The Template Library constitutes a curated registry of pre-built notification configurations, thereby enabling the rapid deployment of proven orchestration patterns without the necessity for manual parameter tuning.

9.1 Overview

Each template encapsulates a complete configuration for one of the four notification subsystems. Templates are organized by their target subsystem and are annotated with descriptive metadata to facilitate efficient discovery.

Attribute Description
Name Human-readable template identifier
Category Target subsystem (routing, deduplication, throttling, escalation)
Description Functional summary of the template's purpose
Tags Searchable keywords for discovery
Configuration Pre-defined parameter values for the target subsystem

9.2 Available Template Categories

Category Purpose Example Templates
Routing Content-based notification distribution rules Severity-based routing, source-type routing
Deduplication Duplicate suppression configurations Aggressive dedup, conservative dedup
Throttling Rate limiting presets Burst-friendly throttle, strict rate limit
Escalation Multi-level alert policies On-call escalation, business-hours policy

9.3 Template Selection Workflow

The Template Library implements an integrated workflow that bridges template selection with configuration editing. The process is conducted as follows:

  1. Browse and Search: The Template Library panel is opened to browse available templates. The search field or category tabs may be employed to locate a relevant template.
  2. Preview: The template's description, tags, and configuration summary are reviewed prior to selection.
  3. Apply: A template is selected to initiate the application process. The system performs the following actions automatically:
  4. Tab Navigation: The active tab is switched to the subsystem matching the template's category (e.g., selecting a throttling template activates the Throttling tab).
  5. Dialog Auto-Open: The corresponding configuration dialog is opened with the template's pre-filled values, thereby allowing the user to review and adjust parameters before saving.
  6. Quick Templates Hidden: When a template is applied from the Template Library, the in-dialog Quick Templates selector is hidden to avoid confusion between the externally applied template and the dialog's built-in presets.
  7. Save or Discard: The pre-filled values may be modified and saved, or the dialog may be closed to discard the template application.

9.4 Active Template Indicator

When a template has been actively applied, the Template Library panel displays an indicator banner comprising the following elements:

Element Description
Category Icon Visual icon corresponding to the template's subsystem
Template Name Name of the currently applied template
Category Badge Labeled badge showing the target subsystem
Dismiss Button Allows the user to clear the active template selection

This indicator persists for the duration of the current page session. Navigation away from the Advanced Notifications page results in the clearance of the active template state, as templates are designed to serve as ephemeral configuration aids rather than persistent selections.

9.5 Relationship to Quick Templates

Each tab's configuration dialog also provides an independent Quick Templates selector for rapid in-context configuration. The Template Library and Quick Templates serve complementary roles, as delineated below:

Feature Template Library Quick Templates
Scope Cross-subsystem, centralized Per-subsystem, contextual
Access From the page header Within each configuration dialog
Behavior Switches tab + opens dialog Fills form within current dialog
Visibility Always visible Hidden when Template Library applies a template

10. API Reference

10.1 Routing Rules API

Endpoint Method Description
/notifications/routing/rules GET List routing rules
/notifications/routing/rules POST Create routing rule
/notifications/routing/rules/{id} GET Get routing rule
/notifications/routing/rules/{id} PUT Update routing rule
/notifications/routing/rules/{id} DELETE Delete routing rule
/notifications/routing/rules/types GET List available rule types
/notifications/routing/rules/test POST Test rule against context

10.2 Deduplication API

Endpoint Method Description
/notifications/deduplication/configs GET List configs
/notifications/deduplication/configs POST Create config
/notifications/deduplication/configs/{id} GET Get config
/notifications/deduplication/configs/{id} PUT Update config
/notifications/deduplication/configs/{id} DELETE Delete config
/notifications/deduplication/stats GET Get runtime stats

10.3 Throttling API

Endpoint Method Description
/notifications/throttling/configs GET List configs
/notifications/throttling/configs POST Create config
/notifications/throttling/configs/{id} GET Get config
/notifications/throttling/configs/{id} PUT Update config
/notifications/throttling/configs/{id} DELETE Delete config
/notifications/throttling/stats GET Get runtime stats

10.4 Escalation API

Endpoint Method Description
/notifications/escalation/policies GET List policies
/notifications/escalation/policies POST Create policy
/notifications/escalation/policies/{id} GET Get policy
/notifications/escalation/policies/{id} PUT Update policy
/notifications/escalation/policies/{id} DELETE Delete policy
/notifications/escalation/incidents GET List incidents
/notifications/escalation/incidents/active GET List active incidents
/notifications/escalation/incidents/{id} GET Get incident
/notifications/escalation/incidents/{id}/acknowledge POST Acknowledge incident
/notifications/escalation/incidents/{id}/resolve POST Resolve incident
/notifications/escalation/stats GET Get runtime stats

11.1 Routing Practices

Practice Description
Prioritize specificity in rule ordering Lower priority numbers should be assigned to more specific conditions
Ensure provision of a default route All notifications must be guaranteed a destination
Conduct pre-deployment validation The test endpoint should be utilized to validate rules prior to deployment
Maintain comprehensive documentation Routing logic should be documented for team reference and auditability

11.2 Deduplication Practices

Practice Description
Commence with moderate window durations A 5-minute default is recommended, with subsequent adjustment based on observed data
Monitor the suppression ratio A ratio of 10-40% is indicative of healthy deduplication behavior
Employ the SEVERITY policy This policy provides an appropriate balance between deduplication and visibility
Review active fingerprint counts Elevated counts may be indicative of memory pressure

11.3 Throttling Practices

Practice Description
Establish conservative initial limits It is considered preferable to relax constraints than to tighten them retrospectively
Configure per-channel limits Channels of differing urgency levels necessitate distinct limit configurations
Enable the priority bypass mechanism Critical alerts should not be subjected to throttling
Monitor the throttle rate A rate exceeding 10% may be indicative of configuration deficiencies

11.4 Escalation Practices

Practice Description
Design for progressive urgency Subsequent levels should be configured to reach personnel of higher authority
Employ reasonable timeout intervals A progression of 5-15-30 minutes has been found to be effective in practice
Mandate acknowledgment at all levels This ensures that human operators remain engaged in the resolution process
Conduct periodic escalation path testing Regular testing is essential to verify continued operational functionality
Configure appropriate cooldown periods Cooldowns are necessary to prevent the occurrence of escalation storms

References

  1. Beyer, B., et al. (2016). Site Reliability Engineering. O'Reilly Media.
  2. Turner, J. (1986). New directions in communications (or which way to the information age?). IEEE Communications Magazine, 24(10), 8-15. [Token Bucket Algorithm]
  3. ITIL Foundation (2019). ITIL 4 Foundation. Axelos.
  4. truthound Documentation. https://truthound.readthedocs.io/