Advanced Notification Orchestration¶
The Advanced Notifications module provides a comprehensive notification orchestration framework encompassing content-based routing, deduplication, rate limiting, and multi-level escalation policies. This document presents the theoretical foundations, algorithmic formulations, and practical implementation guidelines that underpin enterprise-scale notification management.
1. Introduction¶
1.1 Problem Statement¶
Contemporary data quality monitoring systems are observed to generate substantial volumes of alerts during routine operation. In the absence of appropriate orchestration mechanisms, organizations are subjected to a number of well-documented operational challenges:
| Challenge | Impact |
|---|---|
| Alert Fatigue | Operators become desensitized to frequent notifications |
| Duplicate Notifications | Same issue triggers multiple redundant alerts |
| Notification Storms | Cascading failures overwhelm communication channels |
| Delayed Escalation | Critical issues not escalated to appropriate personnel |
1.2 Solution Architecture¶
The aforementioned challenges are addressed through the provision of four complementary subsystems, each of which is supported by a corresponding module within the truthound library's checkpoint infrastructure:
| Component | Truthound Module | Purpose |
|---|---|---|
| Routing | truthound.checkpoint.routing |
Content-based notification distribution |
| Deduplication | truthound.checkpoint.deduplication |
Duplicate suppression via fingerprinting |
| Throttling | truthound.checkpoint.throttling |
Rate limiting via token bucket algorithms |
| Escalation | truthound.checkpoint.escalation |
Progressive notification with state machine |
1.3 Processing Pipeline¶
┌─────────────────────────────────────────────────────────────────────┐
│ Notification Processing Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Event │ ValidationFailedEvent, DriftDetectedEvent, │
│ │ Trigger │ ScheduleFailedEvent, SchemaChangedEvent │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ TruthoundNotificationAdapter │ │
│ │ ┌────────────┐ ┌─────────────┐ ┌────────────┐ │ │
│ │ │ Routing │→ │Deduplication│→ │ Throttling │ │ │
│ │ │(ActionRouter│ │(Fingerprint)│ │(TokenBucket)│ │ │
│ │ └────────────┘ └─────────────┘ └────────────┘ │ │
│ └─────────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Channel Delivery Layer │ │
│ │ ┌──────┐ ┌───────┐ ┌───────┐ ┌─────────┐ ┌────────┐ │ │
│ │ │Slack │ │ Email │ │ Teams │ │PagerDuty│ │Webhook │ │ │
│ │ └──────┘ └───────┘ └───────┘ └─────────┘ └────────┘ │ │
│ └─────────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Escalation Engine (State Machine) │ │
│ │ PENDING → ACTIVE → ESCALATING → ACKNOWLEDGED → RESOLVED │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
2. Routing Engine Architecture¶
2.1 Theoretical Foundation¶
Content-based routing (CBR) is a well-established message distribution paradigm in which routing decisions are determined by the content of messages rather than by predetermined destination addresses. The truthound routing engine implements a rule-based CBR system that is characterized by the following properties:
- Declarative Rules: Routing rules are specified as predicates evaluated over event attributes
- Priority-Ordered Evaluation: Rules are evaluated in strict priority order to ensure deterministic behavior
- Composable Conditions: Rules may be combined through the application of standard logical operators
2.2 Rule Taxonomy¶
The truthound ActionRouter provides 11 built-in rule types and 3 combinators, which are enumerated below.
2.2.1 Primitive Rules¶
| Rule Type | Parameters | Matching Logic |
|---|---|---|
AlwaysRule |
None | Always matches (default route) |
NeverRule |
None | Never matches (disabled route) |
SeverityRule |
min_severity, max_severity, min_count |
Matches issues by severity level |
IssueCountRule |
min_issues, max_issues, count_type |
Matches by issue count threshold |
StatusRule |
statuses, negate |
Matches by checkpoint status |
TagRule |
tags, match_all, negate |
Matches by resource tags |
DataAssetRule |
pattern, is_regex, case_sensitive |
Matches by data asset name pattern |
MetadataRule |
key_path, expected_value, comparator |
Matches by metadata values |
TimeWindowRule |
start_time, end_time, days_of_week, timezone |
Matches by time period |
PassRateRule |
min_rate, max_rate |
Matches by validation pass rate |
ErrorRule |
pattern, negate |
Matches by error message pattern |
2.2.2 Logical Combinators¶
| Combinator | Semantics | Example Use Case |
|---|---|---|
AllOf |
Logical conjunction (AND) | Critical AND Production |
AnyOf |
Logical disjunction (OR) | Critical OR Error |
NotRule |
Logical negation (NOT) | NOT Development environment |
2.3 Evaluation Modes¶
The ActionRouter supports three distinct evaluation modes, each of which is suited to particular operational requirements:
| Mode | Behavior | Use Case |
|---|---|---|
FIRST_MATCH |
Execute only the first matching route | Mutually exclusive channels |
ALL_MATCHES |
Execute all matching routes | Multi-channel broadcast |
PRIORITY_GROUP |
Execute all routes in the highest priority group | Tiered response |
2.4 Route Context Specification¶
The RouteContext data class encapsulates the complete set of attributes that are made available for rule evaluation:
@dataclass(frozen=True)
class RouteContext:
checkpoint_name: str # Checkpoint identifier
run_id: str # Execution run identifier
status: str # Result status (success, failure, error)
data_asset: str # Data asset name
run_time: datetime # Execution timestamp
total_issues: int # Total issue count
critical_issues: int # Critical severity count
high_issues: int # High severity count
medium_issues: int # Medium severity count
low_issues: int # Low severity count
info_issues: int # Info severity count
pass_rate: float # Validation pass rate (0-100)
tags: dict[str, str] # Resource tags
metadata: dict[str, Any] # Additional metadata
validation_duration_ms: float # Execution duration
error: str | None # Error message if applicable
2.5 Implementation Guidelines¶
Principles for Effective Route Construction¶
- Specificity Principle: It is recommended that more specific rules be assigned lower priority numbers to ensure preferential evaluation
- Default Route Provision: A catch-all route configured with
AlwaysRuleshould always be included to guarantee notification delivery - Pre-Deployment Validation: Rules should be validated using the
/notifications/routing/rules/testendpoint prior to production deployment
Example Configuration¶
routes:
- name: critical_production
priority: 10
rule:
type: all_of
rules:
- type: severity
min_severity: critical
- type: tag
tags: { env: production }
actions:
- pagerduty_channel_id
- slack_critical_channel_id
- name: high_severity_alerts
priority: 50
rule:
type: severity
min_severity: high
actions:
- slack_alerts_channel_id
- name: default_route
priority: 100
rule:
type: always
actions:
- email_team_channel_id
3. Deduplication Engine Architecture¶
3.1 Theoretical Foundation¶
Notification deduplication is concerned with the elimination of redundant alert generation. The approach employed herein is based on fingerprint-based duplicate detection within sliding time windows. This technique is analogous to content-addressable storage systems and bears similarity to bloom filter applications that have been widely studied in the context of distributed systems.
3.1.1 Fingerprint Generation¶
A notification fingerprint is defined as a unique identifier derived from a specified subset of notification attributes:
The fingerprint function is required to satisfy the following properties: - Determinism: Identical inputs must produce identical outputs across all invocations - Collision Resistance: Distinct notifications should yield distinct fingerprints with high probability - Computational Efficiency: Generation must be achievable in O(1) time complexity
3.2 Deduplication Policies¶
The truthound NotificationDeduplicator supports five policies, arranged in order of increasing specificity:
| Policy | Fingerprint Components | Use Case |
|---|---|---|
NONE |
Disabled | No deduplication |
BASIC |
checkpoint_name + action_type | Suppress same alert to same channel |
SEVERITY |
BASIC + severity | Differentiate by severity level |
ISSUE_BASED |
SEVERITY + issue_types | Differentiate by issue categories |
STRICT |
Full notification hash | Maximum differentiation |
3.3 Windowing Strategies¶
Four time-based windowing strategies are provided, each of which exhibits distinct temporal characteristics.
3.3.1 Sliding Window Strategy¶
The sliding window approach maintains a fixed-duration window that advances continuously with time:
Time: ─────────────────────────────────────────────►
│◄──── Window (5 min) ────►│
│ │
│ Notification A │ Notification B (duplicate)
│ t=0 │ t=2 min → SUPPRESSED
│ │
Characteristics: - The window is initiated from each notification event - The implementation is straightforward in nature - Memory consumption is considered efficient
3.3.2 Tumbling Window Strategy¶
This strategy employs non-overlapping fixed-duration buckets:
Time: ─────────────────────────────────────────────►
│◄── Bucket 1 ──►│◄── Bucket 2 ──►│
│ 15 minutes │ 15 minutes │
│ │ │
│ A (allowed) │ A (allowed) │
│ A (suppressed) │ │
Characteristics: - Bucket boundaries are fixed and predetermined - Suppression behavior is predictable and deterministic - Potential edge effects may be observed at bucket boundaries
3.3.3 Session Window Strategy¶
This approach employs event-driven sessions with gap-based expiration semantics:
Time: ─────────────────────────────────────────────►
│◄─ Session 1 ─►│ gap │◄─ Session 2 ─►│
│ │ >10min │ │
│ A B C │ │ A │
│ (A,B,C dedup) │ │ (new session) │
Characteristics: - Window duration is dynamically determined based on notification activity - This strategy is particularly well-suited to bursty notification patterns - Elevated memory consumption may be observed relative to other strategies
3.4 Implementation Guidelines¶
Configuration Parameters¶
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Window Duration | 300 seconds (5 min) | Balances suppression vs. visibility |
| Policy | SEVERITY |
Good balance of differentiation |
| Strategy | sliding |
Simplest, most predictable behavior |
Observational Metrics¶
| Metric | Formula | Target Range |
|---|---|---|
| Suppression Ratio | suppressed / total_evaluated | 10-40% |
| Active Fingerprints | Count of active entries | < 1000 |
4. Throttling Engine Architecture¶
4.1 Theoretical Foundation¶
Rate limiting constitutes a fundamental technique for the protection of systems against overload conditions. The truthound throttling engine implements the Token Bucket Algorithm, a well-established approach that has been extensively studied and deployed in the domains of network traffic shaping and API rate limiting.
4.1.1 Token Bucket Algorithm¶
The token bucket model maintains a bucket with a maximum capacity of B tokens. Tokens are replenished at a rate of r tokens per second. Each notification consumes exactly one token upon processing. In the event that no tokens are available, the notification is subjected to throttling.
Mathematical Formulation:
Where:
- B = burst capacity
- r = token replenishment rate
- Δt = time elapsed since the last state update
Algorithmic Behavior:
Tokens: ████████████ (12 tokens, capacity)
─────────────────────────────────────►
│ Request │ Request │ Request │ ...
│ 12→11 │ 11→10 │ 10→9 │
│ │ │ +0.5/sec │ (replenishment)
4.2 Throttler Implementations¶
Five throttler implementations are provided within the truthound library, each based on a distinct algorithmic approach:
| Throttler | Algorithm | Characteristics |
|---|---|---|
TokenBucketThrottler |
Token Bucket | Allows burst, smooth rate limiting |
SlidingWindowThrottler |
Sliding Window | More accurate, no boundary effects |
FixedWindowThrottler |
Fixed Window | Simple, potential 2x burst at boundaries |
CompositeThrottler |
Multi-level | Combines multiple rate limits |
NoOpThrottler |
Pass-through | Testing/disable mode |
4.3 Rate Limit Scopes¶
Rate limits may be applied at varying levels of granularity, as enumerated in the following table:
| Scope | Bucket Key | Use Case |
|---|---|---|
GLOBAL |
Single bucket | Total notification limit |
PER_ACTION |
action_type | Per-channel limits |
PER_CHECKPOINT |
checkpoint_name | Per-source limits |
PER_ACTION_CHECKPOINT |
action + checkpoint | Fine-grained control |
PER_SEVERITY |
severity | Severity-based limits |
PER_DATA_ASSET |
data_asset | Asset-specific limits |
4.4 Hierarchical Multi-Level Rate Limiting¶
The CompositeThrottler facilitates the construction of hierarchical rate limit configurations:
throttling:
per_minute_limit: 10 # Short-term burst control
per_hour_limit: 100 # Medium-term control
per_day_limit: 500 # Long-term budget
burst_multiplier: 1.5 # 50% burst allowance
Evaluation Logic:
Request arrives
│
├─► Check per_minute limit
│ │
│ ├─► PASS → Check per_hour limit
│ │ │
│ │ ├─► PASS → Check per_day limit
│ │ │ │
│ │ │ ├─► PASS → ALLOWED
│ │ │ └─► FAIL → THROTTLED
│ │ └─► FAIL → THROTTLED
│ └─► FAIL → THROTTLED
4.5 Implementation Guidelines¶
Recommended Configuration Parameters¶
| Channel Type | per_minute | per_hour | per_day | Rationale |
|---|---|---|---|---|
| PagerDuty | 5 | 20 | 100 | On-call fatigue prevention |
| Slack | 20 | 200 | 1000 | Chat noise reduction |
| 5 | 50 | 200 | Inbox management |
Priority Bypass Mechanism¶
It is possible to configure critical notifications to bypass throttling constraints entirely:
When this mechanism is enabled, notifications bearing severity=critical are permitted to bypass all rate limits.
5. Escalation Engine Architecture¶
5.1 Theoretical Foundation¶
Escalation management is implemented through a Finite State Machine (FSM) that governs the lifecycle of incident tracking. This approach is derived from established incident management frameworks, including ITIL, as well as contemporary Site Reliability Engineering (SRE) practices.
5.1.1 Formal State Machine Definition¶
The escalation state machine is formally defined as follows:
FSM = (S, Σ, δ, s₀, F)
Where:
S = {PENDING, ACTIVE, ESCALATING, ACKNOWLEDGED, RESOLVED, CANCELLED, TIMED_OUT, FAILED}
Σ = {start, ack, resolve, cancel, timeout, escalate, error}
s₀ = PENDING
F = {RESOLVED, CANCELLED, TIMED_OUT, FAILED}
5.1.2 State Transition Diagram¶
┌─────────────────────────┐
│ PENDING │
└────────────┬────────────┘
│ start()
▼
┌───────────────────────────────────────────────┐
│ ACTIVE │
└────┬─────────────┬────────────────┬───────────┘
│ │ │
│ ack() │ timeout │ escalate()
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────────┐
│ACKNOWLEDGED│ │ TIMED_OUT │ │ ESCALATING │
└──────┬─────┘ └────────────┘ └───────┬────────┘
│ │
│ resolve() │ next_level
▼ ▼
┌────────────┐ ┌────────────────┐
│ RESOLVED │ │ ACTIVE │
└────────────┘ │ (next level) │
└────────────────┘
cancel() from any state → CANCELLED
error from any state → FAILED
5.2 Escalation Level Configuration¶
Each escalation level is defined by the following parameter set:
| Parameter | Type | Description |
|---|---|---|
level |
int | Level number (1 = first) |
delay_minutes |
int | Delay before escalating to next level |
targets |
list[EscalationTarget] | Notification recipients |
repeat_count |
int | Number of times to repeat at this level |
repeat_interval_minutes |
int | Interval between repeats |
require_ack |
bool | Whether acknowledgment is required |
auto_resolve_minutes |
int | Auto-resolve timeout (0 = disabled) |
5.3 Target Type Classification¶
Escalation targets represent the notification recipients to which alerts are dispatched:
| Target Type | Identifier Format | Description |
|---|---|---|
user |
User ID | Individual user |
team |
Team ID | Team/group |
channel |
Channel ID | Slack channel, etc. |
schedule |
Schedule ID | On-call schedule |
webhook |
URL | Webhook endpoint |
email |
Email address | Direct email |
phone |
Phone number | SMS/voice call |
5.4 Trigger Condition Taxonomy¶
Escalation may be initiated by a variety of conditions, which are categorized as follows:
| Trigger | Description |
|---|---|
UNACKNOWLEDGED |
Alert not acknowledged within timeout |
UNRESOLVED |
Incident not resolved within timeout |
SEVERITY_UPGRADE |
Severity level increased |
REPEATED_FAILURE |
Same issue recurring |
THRESHOLD_BREACH |
Metric exceeded threshold |
MANUAL |
Manual trigger by operator |
SCHEDULED |
Time-based trigger |
5.5 Automated Escalation by Event Severity¶
The NotificationDispatcher is configured to automatically initiate escalation procedures for events of elevated severity:
| Event Type | Condition | Escalation Policy |
|---|---|---|
ValidationFailedEvent |
has_critical=true |
critical_alert |
ValidationFailedEvent |
has_high=true |
high_alert |
DriftDetectedEvent |
has_high_drift=true |
high_alert |
ScheduleFailedEvent |
Always | high_alert |
SchemaChangedEvent |
has_breaking_changes=true |
high_alert |
5.6 Implementation Guidelines¶
Principles for Escalation Policy Design¶
| Principle | Recommendation |
|---|---|
| Level Count | 3-4 levels typically sufficient |
| Timeout Progression | Exponential backoff (5min → 15min → 30min) |
| Final Level | Must reach decision makers |
| Acknowledgment | Require ack at all levels |
Example Policy Configuration¶
escalation_policies:
- name: critical_production
description: Critical production issue escalation
levels:
- level: 1
delay_minutes: 0
targets:
- type: user
identifier: team-lead
name: Team Lead
- type: channel
identifier: "#alerts-critical"
name: Critical Alerts
repeat_count: 2
repeat_interval_minutes: 5
require_ack: true
- level: 2
delay_minutes: 15
targets:
- type: user
identifier: engineering-manager
name: Engineering Manager
- type: schedule
identifier: primary-oncall
name: Primary On-call
- level: 3
delay_minutes: 30
targets:
- type: user
identifier: director
name: Director of Engineering
- type: email
identifier: leadership@company.com
name: Leadership Team
triggers:
- unacknowledged
severity_filter:
- critical
- high
cooldown_minutes: 60
max_escalations: 5
6. Notification Channel Integration¶
6.1 Supported Channels¶
The dashboard is integrated with the following notification action implementations provided by the truthound library:
| Channel | Truthound Action | Protocol |
|---|---|---|
| Slack | SlackNotification |
Incoming Webhook |
EmailNotification |
SMTP/SendGrid/SES | |
| Microsoft Teams | TeamsNotification |
Adaptive Cards |
| Discord | DiscordNotification |
Embed Webhook |
| Telegram | TelegramNotification |
Bot API |
| PagerDuty | PagerDutyAction |
Events API v2 |
| OpsGenie | OpsGenieAction |
REST API |
| Webhook | WebhookAction |
HTTP POST |
| GitHub | GitHubAction |
Issues API |
6.2 Channel Configuration Specifications¶
Each channel type is associated with specific configuration requirements, as detailed below.
Slack Configuration¶
| Parameter | Required | Description |
|---|---|---|
webhook_url |
Yes | Slack Incoming Webhook URL |
channel |
No | Channel override (#channel) |
username |
No | Bot display name |
icon_emoji |
No | Bot icon (:emoji:) |
mention_on_failure |
No | User IDs to mention |
Email Configuration¶
| Parameter | Required | Description |
|---|---|---|
from_address |
Yes | Sender email address |
to_addresses |
Yes | Recipient email addresses |
smtp_host |
Conditional | SMTP server (if provider=smtp) |
provider |
No | smtp, sendgrid, or ses |
api_key |
Conditional | API key (if provider=sendgrid/ses) |
7. Statistical Monitoring and Observability¶
7.1 Metric Categories¶
The Advanced Notifications system exposes two distinct categories of operational metrics:
| Category | Source | Description |
|---|---|---|
| Dashboard Stats | SQLite Database | Configuration counts |
| Runtime Stats | Truthound Library | Processing metrics |
7.2 Key Performance Indicators¶
| KPI | Formula | Target | Action if Exceeded |
|---|---|---|---|
| Dedup Ratio | suppressed / evaluated | 10-40% | Review fingerprint strategy |
| Throttle Rate | throttled / checked | < 10% | Increase rate limits |
| Ack Rate | acknowledged / triggered | > 80% | Review escalation timeouts |
| MTTA | avg(ack_time - trigger_time) | < 15 min | Review escalation targets |
| MTTR | avg(resolve_time - trigger_time) | < 60 min | Review resolution process |
7.3 Statistics API¶
Response:
{
"routing": {
"total_routes": 12,
"mode": "all_matches"
},
"deduplication": {
"total_evaluated": 1250,
"suppressed": 340,
"suppression_ratio": 0.272,
"active_fingerprints": 45
},
"throttling": {
"total_checked": 1250,
"total_allowed": 1150,
"total_throttled": 100,
"throttle_rate": 0.08
},
"escalation": {
"total_escalations": 23,
"active_escalations": 2,
"acknowledged_count": 18,
"resolved_count": 15,
"acknowledgment_rate": 0.78,
"avg_time_to_acknowledge": 420
}
}
8. Configuration Management¶
8.1 Configuration Export¶
All Advanced Notification configurations may be exported via the following endpoint:
GET /notifications/config/export?include_routing_rules=true&include_deduplication=true&include_throttling=true&include_escalation=true
8.2 Configuration Import¶
Configurations may be imported with configurable conflict resolution semantics:
POST /notifications/config/import
Content-Type: application/json
{
"bundle": { ... },
"conflict_resolution": "skip" | "overwrite" | "rename"
}
8.3 Configuration Bundle Schema¶
{
"version": "1.0",
"exported_at": "2025-01-29T12:00:00Z",
"routing_rules": [...],
"deduplication_configs": [...],
"throttling_configs": [...],
"escalation_policies": [...]
}
9. Template Library¶
The Template Library constitutes a curated registry of pre-built notification configurations, thereby enabling the rapid deployment of proven orchestration patterns without the necessity for manual parameter tuning.
9.1 Overview¶
Each template encapsulates a complete configuration for one of the four notification subsystems. Templates are organized by their target subsystem and are annotated with descriptive metadata to facilitate efficient discovery.
| Attribute | Description |
|---|---|
| Name | Human-readable template identifier |
| Category | Target subsystem (routing, deduplication, throttling, escalation) |
| Description | Functional summary of the template's purpose |
| Tags | Searchable keywords for discovery |
| Configuration | Pre-defined parameter values for the target subsystem |
9.2 Available Template Categories¶
| Category | Purpose | Example Templates |
|---|---|---|
| Routing | Content-based notification distribution rules | Severity-based routing, source-type routing |
| Deduplication | Duplicate suppression configurations | Aggressive dedup, conservative dedup |
| Throttling | Rate limiting presets | Burst-friendly throttle, strict rate limit |
| Escalation | Multi-level alert policies | On-call escalation, business-hours policy |
9.3 Template Selection Workflow¶
The Template Library implements an integrated workflow that bridges template selection with configuration editing. The process is conducted as follows:
- Browse and Search: The Template Library panel is opened to browse available templates. The search field or category tabs may be employed to locate a relevant template.
- Preview: The template's description, tags, and configuration summary are reviewed prior to selection.
- Apply: A template is selected to initiate the application process. The system performs the following actions automatically:
- Tab Navigation: The active tab is switched to the subsystem matching the template's category (e.g., selecting a throttling template activates the Throttling tab).
- Dialog Auto-Open: The corresponding configuration dialog is opened with the template's pre-filled values, thereby allowing the user to review and adjust parameters before saving.
- Quick Templates Hidden: When a template is applied from the Template Library, the in-dialog Quick Templates selector is hidden to avoid confusion between the externally applied template and the dialog's built-in presets.
- Save or Discard: The pre-filled values may be modified and saved, or the dialog may be closed to discard the template application.
9.4 Active Template Indicator¶
When a template has been actively applied, the Template Library panel displays an indicator banner comprising the following elements:
| Element | Description |
|---|---|
| Category Icon | Visual icon corresponding to the template's subsystem |
| Template Name | Name of the currently applied template |
| Category Badge | Labeled badge showing the target subsystem |
| Dismiss Button | Allows the user to clear the active template selection |
This indicator persists for the duration of the current page session. Navigation away from the Advanced Notifications page results in the clearance of the active template state, as templates are designed to serve as ephemeral configuration aids rather than persistent selections.
9.5 Relationship to Quick Templates¶
Each tab's configuration dialog also provides an independent Quick Templates selector for rapid in-context configuration. The Template Library and Quick Templates serve complementary roles, as delineated below:
| Feature | Template Library | Quick Templates |
|---|---|---|
| Scope | Cross-subsystem, centralized | Per-subsystem, contextual |
| Access | From the page header | Within each configuration dialog |
| Behavior | Switches tab + opens dialog | Fills form within current dialog |
| Visibility | Always visible | Hidden when Template Library applies a template |
10. API Reference¶
10.1 Routing Rules API¶
| Endpoint | Method | Description |
|---|---|---|
/notifications/routing/rules |
GET | List routing rules |
/notifications/routing/rules |
POST | Create routing rule |
/notifications/routing/rules/{id} |
GET | Get routing rule |
/notifications/routing/rules/{id} |
PUT | Update routing rule |
/notifications/routing/rules/{id} |
DELETE | Delete routing rule |
/notifications/routing/rules/types |
GET | List available rule types |
/notifications/routing/rules/test |
POST | Test rule against context |
10.2 Deduplication API¶
| Endpoint | Method | Description |
|---|---|---|
/notifications/deduplication/configs |
GET | List configs |
/notifications/deduplication/configs |
POST | Create config |
/notifications/deduplication/configs/{id} |
GET | Get config |
/notifications/deduplication/configs/{id} |
PUT | Update config |
/notifications/deduplication/configs/{id} |
DELETE | Delete config |
/notifications/deduplication/stats |
GET | Get runtime stats |
10.3 Throttling API¶
| Endpoint | Method | Description |
|---|---|---|
/notifications/throttling/configs |
GET | List configs |
/notifications/throttling/configs |
POST | Create config |
/notifications/throttling/configs/{id} |
GET | Get config |
/notifications/throttling/configs/{id} |
PUT | Update config |
/notifications/throttling/configs/{id} |
DELETE | Delete config |
/notifications/throttling/stats |
GET | Get runtime stats |
10.4 Escalation API¶
| Endpoint | Method | Description |
|---|---|---|
/notifications/escalation/policies |
GET | List policies |
/notifications/escalation/policies |
POST | Create policy |
/notifications/escalation/policies/{id} |
GET | Get policy |
/notifications/escalation/policies/{id} |
PUT | Update policy |
/notifications/escalation/policies/{id} |
DELETE | Delete policy |
/notifications/escalation/incidents |
GET | List incidents |
/notifications/escalation/incidents/active |
GET | List active incidents |
/notifications/escalation/incidents/{id} |
GET | Get incident |
/notifications/escalation/incidents/{id}/acknowledge |
POST | Acknowledge incident |
/notifications/escalation/incidents/{id}/resolve |
POST | Resolve incident |
/notifications/escalation/stats |
GET | Get runtime stats |
11. Recommended Operational Practices¶
11.1 Routing Practices¶
| Practice | Description |
|---|---|
| Prioritize specificity in rule ordering | Lower priority numbers should be assigned to more specific conditions |
| Ensure provision of a default route | All notifications must be guaranteed a destination |
| Conduct pre-deployment validation | The test endpoint should be utilized to validate rules prior to deployment |
| Maintain comprehensive documentation | Routing logic should be documented for team reference and auditability |
11.2 Deduplication Practices¶
| Practice | Description |
|---|---|
| Commence with moderate window durations | A 5-minute default is recommended, with subsequent adjustment based on observed data |
| Monitor the suppression ratio | A ratio of 10-40% is indicative of healthy deduplication behavior |
| Employ the SEVERITY policy | This policy provides an appropriate balance between deduplication and visibility |
| Review active fingerprint counts | Elevated counts may be indicative of memory pressure |
11.3 Throttling Practices¶
| Practice | Description |
|---|---|
| Establish conservative initial limits | It is considered preferable to relax constraints than to tighten them retrospectively |
| Configure per-channel limits | Channels of differing urgency levels necessitate distinct limit configurations |
| Enable the priority bypass mechanism | Critical alerts should not be subjected to throttling |
| Monitor the throttle rate | A rate exceeding 10% may be indicative of configuration deficiencies |
11.4 Escalation Practices¶
| Practice | Description |
|---|---|
| Design for progressive urgency | Subsequent levels should be configured to reach personnel of higher authority |
| Employ reasonable timeout intervals | A progression of 5-15-30 minutes has been found to be effective in practice |
| Mandate acknowledgment at all levels | This ensures that human operators remain engaged in the resolution process |
| Conduct periodic escalation path testing | Regular testing is essential to verify continued operational functionality |
| Configure appropriate cooldown periods | Cooldowns are necessary to prevent the occurrence of escalation storms |
References¶
- Beyer, B., et al. (2016). Site Reliability Engineering. O'Reilly Media.
- Turner, J. (1986). New directions in communications (or which way to the information age?). IEEE Communications Magazine, 24(10), 8-15. [Token Bucket Algorithm]
- ITIL Foundation (2019). ITIL 4 Foundation. Axelos.
- truthound Documentation. https://truthound.readthedocs.io/