Skip to content

Observability

The Observability module provides comprehensive system monitoring capabilities through three interconnected pillars: Audit Logging, Metrics Collection, and Distributed Tracing. This facility leverages the observability infrastructure native to truthound's store layer, thereby furnishing enterprise-grade visibility into the operational behavior of the system.

General Overview

Observability within contemporary data quality systems transcends the scope of traditional monitoring by affording deep insight into system behavior, performance characteristics, and operational patterns. The Truthound Dashboard implements the three canonical pillars of observability, enabling administrators to ascertain not merely what transpired, but why it transpired.

The Three Pillars of Observability

Pillar Purpose Key Questions Answered
Audit Logging Immutable record of all operations Who did what, when, and what was the outcome?
Metrics Quantitative measurements over time How is the system performing? What are the trends?
Tracing Request flow across components Where are bottlenecks? How do operations flow?

Theoretical Foundations

Audit Logging

Audit logging furnishes a chronological, immutable record of all significant operations performed within the system. In contrast to application logs, which are designed primarily for debugging purposes, audit logs are intended to serve compliance, security, and operational analysis objectives.

Audit Event Model

Each audit event captures contextual information in accordance with the W5 principle:

Dimension Field Description
Who user_id, session_id Identity of the actor
What event_type, store_type Operation performed
When timestamp Precise occurrence time
Where store_id, item_id Target of the operation
Why/Result status, error_message Outcome and context

Event Type Taxonomy

Category Event Types Description
CRUD Operations create, read, update, delete Standard data operations
Batch Operations batch_create, batch_delete Bulk data modifications
Query Operations query, list, count Data retrieval operations
Lifecycle Events initialize, close, flush Store lifecycle management
Sync Operations replicate, sync, migrate, rollback Data synchronization
Access Control access_denied, access_granted Security-related events
Errors error, validation_error Failure conditions

Audit Status Classification

Status Description Use Case
Success Operation completed normally Standard operations
Failure Operation failed with error Error analysis
Partial Batch operation partially succeeded Batch processing
Denied Operation rejected by access control Security monitoring

Metrics Collection

Metrics provide quantitative measurements that facilitate trend analysis, capacity planning, and performance optimization. The system collects four distinct metric types in accordance with the RED and USE methodologies.

Metric Type Definitions

Type Description Example
Counter Monotonically increasing value Total operations, errors
Gauge Point-in-time measurement Active connections, cache size
Histogram Distribution of values Request latency distribution
Summary Statistical summary with quantiles Response time percentiles

Store-Level Metrics

Metric Category Metrics Purpose
Operations operations_total, operations_by_type Throughput analysis
I/O bytes_read_total, bytes_written_total Data transfer volume
Connections active_connections Resource utilization
Cache cache_hits, cache_misses, cache_hit_rate Cache effectiveness
Errors errors_total, errors_by_type Reliability analysis
Latency avg_operation_duration_ms Performance tracking

Cache Hit Rate Interpretation

The cache hit rate is regarded as a critical indicator for evaluating system efficiency:

Hit Rate Interpretation Recommended Action
> 90% Excellent Maintain current configuration
70-90% Good Monitor for degradation
50-70% Acceptable Consider cache size increase
< 50% Poor Review access patterns, increase cache

Distributed Tracing

Distributed tracing provides visibility into the flow of requests across system components, thereby enabling the identification of latency bottlenecks and failure points within the execution path.

Fundamental Tracing Concepts

Concept Description
Trace End-to-end journey of a request
Span Single unit of work within a trace
Context Propagated metadata (trace_id, span_id)
Parent Span The span that initiated the current span

Span Classification (SpanKind)

Kind Description Use Case
Internal Internal operation Business logic processing
Server Server-side handler API endpoint processing
Client Client-side request External service calls
Producer Message producer Async message sending
Consumer Message consumer Async message processing

Observability Interface Specification

The Observability page presents a unified view organized into five distinct tabs, each of which is described in the subsections that follow.

1. Overview Tab

This tab displays key summary statistics drawn from all three observability pillars:

Card Metrics Purpose
Total Events Audit event count Volume indicator
Events Today Today's event count Current activity
Error Rate Failure percentage System health
Cache Hit Rate Cache effectiveness Performance indicator

2. Audit Tab

This tab provides audit event exploration with comprehensive filtering capabilities.

Available Filter Options

Filter Description
Event Type Filter by specific operation type
Status Filter by outcome (success, failure, partial, denied)
Time Range Filter by start and end time
Item ID Filter by specific data item

Audit Table Column Definitions

Column Description
Event ID Unique identifier
Type Operation type
Timestamp When the event occurred
Status Operation outcome
Store Target store type
Duration Operation duration in milliseconds

3. Metrics Tab

This tab displays store-level metrics, organized by category as detailed below.

Operations Metrics

Metric Description
Operations Total Cumulative operation count
Operations by Type Breakdown by operation type

I/O Metrics

Metric Description
Bytes Read Total data read volume
Bytes Written Total data written volume

Cache Metrics

Metric Description
Cache Hits Successful cache retrievals
Cache Misses Cache misses requiring data fetch
Hit Rate Percentage of successful cache hits

Error Metrics

Metric Description
Errors Total Cumulative error count
Errors by Type Breakdown by error category

4. Tracing Tab

This tab displays distributed tracing statistics when tracing has been enabled in the system configuration.

Metric Description
Total Traces Number of complete traces
Total Spans Number of individual spans
Avg Trace Duration Average end-to-end latency
Traces Today Today's trace count
Error Rate Percentage of failed spans
By Service Breakdown by service name

5. Configuration Tab

This tab enables the configuration of observability features through the following parameters.

Configuration Parameters

Setting Type Default Description
Enable Audit Boolean true Toggle audit logging
Enable Metrics Boolean true Toggle metrics collection
Enable Tracing Boolean false Toggle distributed tracing
Audit Log Path String null File path for audit log persistence
Audit Rotate Daily Boolean false Enable daily log rotation
Audit Max Events Integer 10000 Maximum events in memory
Redact Fields Array [] Fields to redact from audit logs
Metrics Prefix String "truthound" Prefix for metric names
Tracing Service Name String "dashboard" Service identifier in traces
Tracing Endpoint String null OpenTelemetry collector endpoint

Data Privacy and Field-Level Redaction

The observability system incorporates field-level redaction mechanisms to ensure that sensitive data is appropriately protected within audit logs.

Redaction Configuration

The following field types may be configured for automatic redaction:

Field Type Example Fields Redaction Behavior
Authentication password, token, api_key Replace with [REDACTED]
PII ssn, email, phone Replace with [REDACTED]
Financial credit_card, account_number Replace with [REDACTED]

Redaction Guidelines

Practice Recommendation
Default Redaction Configure common sensitive fields globally
Audit Review Periodically review logs for data leakage
Compliance Alignment Match redaction to regulatory requirements

Audit Log Management

Practice Recommendation
Retention Define retention aligned with compliance
Rotation Enable daily rotation for large deployments
Analysis Regular review of error and denial patterns
Archival Archive old logs to cold storage

Metrics Utilization

Practice Recommendation
Baseline Establish normal operating baselines
Alerting Configure alerts for metric anomalies
Trending Track metrics over time for capacity planning
Dashboard Create role-specific metric dashboards

Tracing Implementation

Practice Recommendation
Sampling Use sampling for high-volume systems
Context Propagation Ensure trace context flows across boundaries
Span Naming Use consistent, descriptive span names
Error Tagging Tag spans with error information

Integration with the Truthound Core Library

The Observability module is integrated with truthound's store observability infrastructure, as described in the following subsections.

Store Manager Integration

The dashboard's StoreManager component provides layered observability through the following architectural stack:

Layer Component Function
Base Store Core data operations
Versioning VersionedStore Change tracking
Caching CachedStore Performance optimization
Tiering TieredStore Data lifecycle management
Observability AuditLogger, Metrics Monitoring

Audit Logger

The truthound AuditLogger is responsible for automatically capturing:

  • All store CRUD operations
  • Operation timing and duration
  • Success and failure outcomes
  • User and session context

Metrics Collector

The truthound metrics subsystem provides the following capabilities:

  • Automatic metric instrumentation
  • Prometheus-compatible metric format
  • Histogram buckets for latency distribution
  • Label-based metric dimensions

Diagnostic and Troubleshooting Procedures

Common Issues and Resolutions

Issue Resolution
Missing Audit Events Verify audit logging is enabled
High Error Rate Review error_message in audit events
Low Cache Hit Rate Increase cache size or TTL
Tracing Not Working Verify tracing endpoint configuration

Performance Considerations

Concern Mitigation
Audit Log Growth Enable rotation, configure max_events
Metrics Overhead Use appropriate collection interval
Tracing Volume Implement sampling for high throughput

API Reference

Configuration Endpoints

Endpoint Method Description
/observability/config GET Retrieve observability configuration
/observability/config PUT Update observability configuration

Statistics Endpoints

Endpoint Method Description
/observability/stats GET Get combined observability statistics

Audit Endpoints

Endpoint Method Description
/observability/audit/events GET List audit events with filters
/observability/audit/stats GET Get audit statistics

Metrics Endpoints

Endpoint Method Description
/observability/metrics GET Get all metrics
/observability/metrics/store GET Get store-specific metrics

Tracing Endpoints

Endpoint Method Description
/observability/tracing/stats GET Get tracing statistics
/observability/tracing/spans GET List spans with pagination

References

Industry Standards

Standard Description
OpenTelemetry Unified observability framework
Prometheus Metrics collection and alerting
Jaeger/Zipkin Distributed tracing systems