Sampling Strategies¶
This document describes sampling strategies for processing large datasets.
Overview¶
The sampling system implemented in src/truthound/profiler/sampling.py provides 8 different strategies.
SamplingMethod Enum¶
class SamplingMethod(str, Enum):
"""Sampling strategies"""
NONE = "none" # No sampling (full data)
RANDOM = "random" # Random sampling
SYSTEMATIC = "systematic" # Systematic sampling (every Nth row)
STRATIFIED = "stratified" # Stratified sampling
RESERVOIR = "reservoir" # Reservoir sampling (streaming)
ADAPTIVE = "adaptive" # Adaptive sampling (automatic selection)
HEAD = "head" # First N rows
HASH = "hash" # Hash-based (reproducible)
SamplingConfig¶
@dataclass
class SamplingConfig:
"""Sampling configuration"""
strategy: SamplingMethod = SamplingMethod.ADAPTIVE
max_rows: int = 100_000 # Maximum sample size
confidence_level: float = 0.95 # Confidence level (0.0-1.0)
random_seed: int | None = None # Random seed (reproducibility)
# Stratified sampling options
stratify_column: str | None = None
# Hash sampling options
hash_column: str | None = None
SamplingMetrics¶
@dataclass
class SamplingMetrics:
"""Sampling result metrics"""
original_row_count: int # Original row count
sampled_row_count: int # Sampled row count
sampling_ratio: float # Sampling ratio
confidence_level: float # Confidence level
margin_of_error: float # Margin of error
strategy_used: SamplingMethod
execution_time_ms: float
Strategy-Specific Usage¶
NONE - No Sampling¶
from truthound.profiler.sampling import Sampler, SamplingConfig, SamplingMethod
config = SamplingConfig(strategy=SamplingMethod.NONE)
sampler = Sampler(config)
result = sampler.sample(lf)
# Returns full data
RANDOM - Random Sampling¶
config = SamplingConfig(
strategy=SamplingMethod.RANDOM,
max_rows=10_000,
random_seed=42,
)
sampler = Sampler(config)
result = sampler.sample(lf)
print(f"Sampled: {result.metrics.sampled_row_count}")
print(f"Margin of error: {result.metrics.margin_of_error:.2%}")
SYSTEMATIC - Systematic Sampling¶
Selects every Nth row.
config = SamplingConfig(
strategy=SamplingMethod.SYSTEMATIC,
max_rows=10_000,
)
sampler = Sampler(config)
result = sampler.sample(lf)
# Evenly spaced sampling from sorted data
STRATIFIED - Stratified Sampling¶
Maintains the distribution of a specific column while sampling.
config = SamplingConfig(
strategy=SamplingMethod.STRATIFIED,
max_rows=10_000,
stratify_column="category", # Maintain this column's distribution
)
sampler = Sampler(config)
result = sampler.sample(lf)
# Category column proportions remain the same as original
RESERVOIR - Reservoir Sampling¶
An algorithm suitable for streaming data.
config = SamplingConfig(
strategy=SamplingMethod.RESERVOIR,
max_rows=10_000,
)
sampler = Sampler(config)
result = sampler.sample(lf)
# Equal probability sampling with O(1) memory
ADAPTIVE - Adaptive Sampling¶
Automatically selects the optimal strategy based on data size.
config = SamplingConfig(
strategy=SamplingMethod.ADAPTIVE,
max_rows=50_000,
confidence_level=0.95,
)
sampler = Sampler(config)
result = sampler.sample(lf)
# Automatic selection logic:
# - Small datasets: NONE
# - Medium datasets: RANDOM
# - Large datasets: RESERVOIR or HASH
HEAD - First N Rows¶
The fastest sampling method.
config = SamplingConfig(
strategy=SamplingMethod.HEAD,
max_rows=1_000,
)
sampler = Sampler(config)
result = sampler.sample(lf)
# Returns only the first 1,000 rows
HASH - Hash-Based Sampling¶
Reproducible deterministic sampling.
config = SamplingConfig(
strategy=SamplingMethod.HASH,
max_rows=10_000,
hash_column="id", # Column for hash basis
)
sampler = Sampler(config)
result = sampler.sample(lf)
# Same ID always included in the same sample
SamplingMethodRegistry¶
Thread-safe strategy registry.
from truthound.profiler.sampling import SamplingMethodRegistry
# Retrieve strategy
strategy_class = SamplingMethodRegistry.get(SamplingMethod.RANDOM)
# Register custom strategy
@SamplingMethodRegistry.register("my_strategy")
class MyCustomStrategy:
def sample(self, lf: pl.LazyFrame, config: SamplingConfig) -> SamplingResult:
# Custom sampling logic
pass
Statistical Sample Size Calculation¶
from truthound.profiler.sampling import calculate_sample_size
# 95% confidence level, 5% margin of error
sample_size = calculate_sample_size(
population_size=1_000_000,
confidence_level=0.95,
margin_of_error=0.05,
)
print(f"Required sample size: {sample_size}") # ~385
Memory-Safe Sampling¶
The Sampler internally uses .head(limit).collect() to prevent OOM:
# Safe implementation (internal)
def _safe_sample(self, lf: pl.LazyFrame) -> pl.DataFrame:
# Apply limit without calling full collect()
return lf.head(self.config.max_rows).collect()
CLI Usage¶
# Random sampling
th profile data.csv --sample-size 10000 --sample-strategy random
# Hash-based sampling
th profile data.csv --sample-size 10000 --sample-strategy hash --hash-column id
# Adaptive sampling (default)
th profile data.csv --sample-size 50000
Strategy Selection Guide¶
| Scenario | Recommended Strategy |
|---|---|
| Small data (<100K) | NONE |
| Quick preview | HEAD |
| General analysis | RANDOM or ADAPTIVE |
| Preserve distribution | STRATIFIED |
| Streaming data | RESERVOIR |
| Reproducibility needed | HASH |
| Sorted data | SYSTEMATIC |
Next Steps¶
- Pattern Matching - Detect patterns in sampled data
- Distributed Processing - Parallel processing for large data