Observability¶

The Observability module provides comprehensive system monitoring capabilities through three interconnected pillars: Audit Logging, Metrics Collection, and Distributed Tracing. This facility leverages the observability infrastructure native to truthound's store layer, thereby furnishing enterprise-grade visibility into the operational behavior of the system.

General Overview¶

Observability within contemporary data quality systems transcends the scope of traditional monitoring by affording deep insight into system behavior, performance characteristics, and operational patterns. The Truthound Dashboard implements the three canonical pillars of observability, enabling administrators to ascertain not merely what transpired, but why it transpired.

The Three Pillars of Observability¶

Pillar	Purpose	Key Questions Answered
Audit Logging	Immutable record of all operations	Who did what, when, and what was the outcome?
Metrics	Quantitative measurements over time	How is the system performing? What are the trends?
Tracing	Request flow across components	Where are bottlenecks? How do operations flow?

Theoretical Foundations¶

Audit Logging¶

Audit logging furnishes a chronological, immutable record of all significant operations performed within the system. In contrast to application logs, which are designed primarily for debugging purposes, audit logs are intended to serve compliance, security, and operational analysis objectives.

Audit Event Model¶

Each audit event captures contextual information in accordance with the W5 principle:

Dimension	Field	Description
Who	`user_id`, `session_id`	Identity of the actor
What	`event_type`, `store_type`	Operation performed
When	`timestamp`	Precise occurrence time
Where	`store_id`, `item_id`	Target of the operation
Why/Result	`status`, `error_message`	Outcome and context

Event Type Taxonomy¶

Category	Event Types	Description
CRUD Operations	create, read, update, delete	Standard data operations
Batch Operations	batch_create, batch_delete	Bulk data modifications
Query Operations	query, list, count	Data retrieval operations
Lifecycle Events	initialize, close, flush	Store lifecycle management
Sync Operations	replicate, sync, migrate, rollback	Data synchronization
Access Control	access_denied, access_granted	Security-related events
Errors	error, validation_error	Failure conditions

Audit Status Classification¶

Status	Description	Use Case
Success	Operation completed normally	Standard operations
Failure	Operation failed with error	Error analysis
Partial	Batch operation partially succeeded	Batch processing
Denied	Operation rejected by access control	Security monitoring

Metrics Collection¶

Metrics provide quantitative measurements that facilitate trend analysis, capacity planning, and performance optimization. The system collects four distinct metric types in accordance with the RED and USE methodologies.

Metric Type Definitions¶

Type	Description	Example
Counter	Monotonically increasing value	Total operations, errors
Gauge	Point-in-time measurement	Active connections, cache size
Histogram	Distribution of values	Request latency distribution
Summary	Statistical summary with quantiles	Response time percentiles

Store-Level Metrics¶

Metric Category	Metrics	Purpose
Operations	operations_total, operations_by_type	Throughput analysis
I/O	bytes_read_total, bytes_written_total	Data transfer volume
Connections	active_connections	Resource utilization
Cache	cache_hits, cache_misses, cache_hit_rate	Cache effectiveness
Errors	errors_total, errors_by_type	Reliability analysis
Latency	avg_operation_duration_ms	Performance tracking

Cache Hit Rate Interpretation¶

The cache hit rate is regarded as a critical indicator for evaluating system efficiency:

Hit Rate	Interpretation	Recommended Action
> 90%	Excellent	Maintain current configuration
70-90%	Good	Monitor for degradation
50-70%	Acceptable	Consider cache size increase
< 50%	Poor	Review access patterns, increase cache

Distributed Tracing¶

Distributed tracing provides visibility into the flow of requests across system components, thereby enabling the identification of latency bottlenecks and failure points within the execution path.

Fundamental Tracing Concepts¶

Concept	Description
Trace	End-to-end journey of a request
Span	Single unit of work within a trace
Context	Propagated metadata (trace_id, span_id)
Parent Span	The span that initiated the current span

Span Classification (SpanKind)¶

Kind	Description	Use Case
Internal	Internal operation	Business logic processing
Server	Server-side handler	API endpoint processing
Client	Client-side request	External service calls
Producer	Message producer	Async message sending
Consumer	Message consumer	Async message processing

Observability Interface Specification¶

The Observability page presents a unified view organized into five distinct tabs, each of which is described in the subsections that follow.

1. Overview Tab¶

This tab displays key summary statistics drawn from all three observability pillars:

Card	Metrics	Purpose
Total Events	Audit event count	Volume indicator
Events Today	Today's event count	Current activity
Error Rate	Failure percentage	System health
Cache Hit Rate	Cache effectiveness	Performance indicator

2. Audit Tab¶

This tab provides audit event exploration with comprehensive filtering capabilities.

Filter	Description
Event Type	Filter by specific operation type
Status	Filter by outcome (success, failure, partial, denied)
Time Range	Filter by start and end time
Item ID	Filter by specific data item

Audit Table Column Definitions¶

Column	Description
Event ID	Unique identifier
Type	Operation type
Timestamp	When the event occurred
Status	Operation outcome
Store	Target store type
Duration	Operation duration in milliseconds

3. Metrics Tab¶

This tab displays store-level metrics, organized by category as detailed below.

Operations Metrics¶

Metric	Description
Operations Total	Cumulative operation count
Operations by Type	Breakdown by operation type

I/O Metrics¶

Metric	Description
Bytes Read	Total data read volume
Bytes Written	Total data written volume

Cache Metrics¶

Metric	Description
Cache Hits	Successful cache retrievals
Cache Misses	Cache misses requiring data fetch
Hit Rate	Percentage of successful cache hits

Error Metrics¶

Metric	Description
Errors Total	Cumulative error count
Errors by Type	Breakdown by error category

4. Tracing Tab¶

This tab displays distributed tracing statistics when tracing has been enabled in the system configuration.

Metric	Description
Total Traces	Number of complete traces
Total Spans	Number of individual spans
Avg Trace Duration	Average end-to-end latency
Traces Today	Today's trace count
Error Rate	Percentage of failed spans
By Service	Breakdown by service name

5. Configuration Tab¶

This tab enables the configuration of observability features through the following parameters.

Configuration Parameters¶

Setting	Type	Default	Description
Enable Audit	Boolean	true	Toggle audit logging
Enable Metrics	Boolean	true	Toggle metrics collection
Enable Tracing	Boolean	false	Toggle distributed tracing
Audit Log Path	String	null	File path for audit log persistence
Audit Rotate Daily	Boolean	false	Enable daily log rotation
Audit Max Events	Integer	10000	Maximum events in memory
Redact Fields	Array	[]	Fields to redact from audit logs
Metrics Prefix	String	"truthound"	Prefix for metric names
Tracing Service Name	String	"dashboard"	Service identifier in traces
Tracing Endpoint	String	null	OpenTelemetry collector endpoint

Data Privacy and Field-Level Redaction¶

The observability system incorporates field-level redaction mechanisms to ensure that sensitive data is appropriately protected within audit logs.

Redaction Configuration¶

The following field types may be configured for automatic redaction:

Field Type	Example Fields	Redaction Behavior
Authentication	password, token, api_key	Replace with `[REDACTED]`
PII	ssn, email, phone	Replace with `[REDACTED]`
Financial	credit_card, account_number	Replace with `[REDACTED]`

Redaction Guidelines¶

Practice	Recommendation
Default Redaction	Configure common sensitive fields globally
Audit Review	Periodically review logs for data leakage
Compliance Alignment	Match redaction to regulatory requirements

Recommended Operational Practices¶

Audit Log Management¶

Practice	Recommendation
Retention	Define retention aligned with compliance
Rotation	Enable daily rotation for large deployments
Analysis	Regular review of error and denial patterns
Archival	Archive old logs to cold storage

Metrics Utilization¶

Practice	Recommendation
Baseline	Establish normal operating baselines
Alerting	Configure alerts for metric anomalies
Trending	Track metrics over time for capacity planning
Dashboard	Create role-specific metric dashboards

Tracing Implementation¶

Practice	Recommendation
Sampling	Use sampling for high-volume systems
Context Propagation	Ensure trace context flows across boundaries
Span Naming	Use consistent, descriptive span names
Error Tagging	Tag spans with error information

Integration with the Truthound Core Library¶

The Observability module is integrated with truthound's store observability infrastructure, as described in the following subsections.

Store Manager Integration¶

The dashboard's StoreManager component provides layered observability through the following architectural stack:

Layer	Component	Function
Base	Store	Core data operations
Versioning	VersionedStore	Change tracking
Caching	CachedStore	Performance optimization
Tiering	TieredStore	Data lifecycle management
Observability	AuditLogger, Metrics	Monitoring

Audit Logger¶

The truthound AuditLogger is responsible for automatically capturing:

All store CRUD operations
Operation timing and duration
Success and failure outcomes
User and session context

Metrics Collector¶

The truthound metrics subsystem provides the following capabilities:

Automatic metric instrumentation
Prometheus-compatible metric format
Histogram buckets for latency distribution
Label-based metric dimensions

Diagnostic and Troubleshooting Procedures¶

Common Issues and Resolutions¶

Issue	Resolution
Missing Audit Events	Verify audit logging is enabled
High Error Rate	Review error_message in audit events
Low Cache Hit Rate	Increase cache size or TTL
Tracing Not Working	Verify tracing endpoint configuration

Performance Considerations¶

Concern	Mitigation
Audit Log Growth	Enable rotation, configure max_events
Metrics Overhead	Use appropriate collection interval
Tracing Volume	Implement sampling for high throughput

API Reference¶

Configuration Endpoints¶

Endpoint	Method	Description
`/observability/config`	GET	Retrieve observability configuration
`/observability/config`	PUT	Update observability configuration

Statistics Endpoints¶

Endpoint	Method	Description
`/observability/stats`	GET	Get combined observability statistics

Audit Endpoints¶

Endpoint	Method	Description
`/observability/audit/events`	GET	List audit events with filters
`/observability/audit/stats`	GET	Get audit statistics

Metrics Endpoints¶

Endpoint	Method	Description
`/observability/metrics`	GET	Get all metrics
`/observability/metrics/store`	GET	Get store-specific metrics

Tracing Endpoints¶

Endpoint	Method	Description
`/observability/tracing/stats`	GET	Get tracing statistics
`/observability/tracing/spans`	GET	List spans with pagination

References¶

Industry Standards¶

Standard	Description
OpenTelemetry	Unified observability framework
Prometheus	Metrics collection and alerting
Jaeger/Zipkin	Distributed tracing systems

Maintenance - System maintenance and cleanup
Storage Tiering - Data lifecycle management
Alerts - Alert management and notification