truthound mask¶
Mask sensitive data in a file. This command anonymizes PII and sensitive columns using various masking strategies.
Synopsis¶
Arguments¶
| Argument | Required | Description |
|---|---|---|
file |
Yes | Path to the data file (CSV, JSON, Parquet, NDJSON, JSONL) |
Options¶
| Option | Short | Default | Description |
|---|---|---|---|
--output |
-o |
Required | Output file path |
--columns |
-c |
Auto-detect | Comma-separated columns to mask |
--strategy |
-s |
redact |
Masking strategy (redact, hash, fake) |
--strict |
false |
Fail if specified columns don't exist |
Description¶
The mask command provides data anonymization capabilities:
- Auto-Detection Mode: Automatically detects PII columns and masks them
- Explicit Mode: Mask specific columns you specify
- Multiple Strategies: Choose how data is transformed
Masking Strategies¶
| Strategy | Description | Reversible | Example |
|---|---|---|---|
redact |
Replace with asterisks | No | john@example.com → **** |
hash |
SHA-256 hash (truncated) | No | john@example.com → a8b9c0d1e2f3... |
fake |
Realistic fake data | No | john@example.com → alice@fake.net |
Examples¶
Basic Masking (Auto-Detection)¶
Automatically detect and mask PII columns:
Output:
Note: If columns are not specified, the command auto-detects PII columns. Warnings are displayed if any specified columns don't exist (when --strict is not used).
Mask Specific Columns¶
Explicitly specify columns to mask:
Hash Strategy¶
Use hashing for consistent anonymization:
Benefits of hashing: - Same input → same output (deterministic) - Can join tables on hashed values - Cannot reverse to original value
Fake Data Strategy¶
Replace with realistic fake data:
Benefits: - Data looks realistic for testing - Maintains data type and format - Useful for demo environments
Strict Mode¶
Fail if a specified column doesn't exist:
# This will fail if 'nonexistent' column doesn't exist
truthound mask data.csv -o masked.csv -c email,nonexistent --strict
Without --strict, missing columns generate a warning but continue:
# This will warn but continue with existing columns
truthound mask data.csv -o masked.csv -c email,nonexistent
# Warning: Column 'nonexistent' not found. Skipping.
Strategy Comparison¶
Redact¶
Original: john.doe@example.com
Redacted: ****
Original: +1-555-123-4567
Redacted: ****
Original: 123-45-6789
Redacted: ****
Best for: Maximum privacy, compliance reports
Hash¶
Original: john.doe@example.com
Hashed: a8b9c0d1e2f34567
Original: +1-555-123-4567
Hashed: b2c3d4e5f6a78901
Original: 123-45-6789
Hashed: c3d4e5f6a7890123
Best for: Data analysis, joining datasets, referential integrity
Fake¶
Original: john.doe@example.com
Fake: alice.smith@fake.net
Original: +1-555-123-4567
Fake: +1-555-987-6543
Original: 123-45-6789
Fake: 987-65-4321
Best for: Testing, demos, development environments
Output Formats¶
The output format is determined by the file extension:
# CSV output
truthound mask data.csv -o masked.csv
# Parquet output
truthound mask data.csv -o masked.parquet
# JSON output
truthound mask data.csv -o masked.json
Use Cases¶
1. Data Sharing¶
Prepare data for external sharing:
# Scan to identify PII
truthound scan production.csv
# Mask detected PII
truthound mask production.csv -o shareable.csv --strategy hash
2. Test Data Generation¶
Create test data from production:
3. Compliance Pipeline¶
GDPR/CCPA compliance workflow:
# Export data with PII masked
truthound mask customer_export.csv -o compliant_export.csv -c email,phone,address
4. CI/CD Integration¶
Mask data before deployment:
- name: Anonymize Test Data
run: |
truthound mask fixtures/users.csv -o fixtures/users_safe.csv --strategy hash
truthound mask fixtures/orders.csv -o fixtures/orders_safe.csv --strategy hash
Workflow Example¶
graph TD
A[Production Data] --> B[truthound scan]
B --> C{PII Found?}
C -->|Yes| D[Identify Columns]
C -->|No| E[Use As-Is]
D --> F[truthound mask]
F --> G[Verify: truthound scan]
G --> H{Clean?}
H -->|Yes| I[Safe Data]
H -->|No| J[Review Masking]
J --> F
Performance¶
For large datasets, masking uses:
- Streaming Mode: Files >1M rows use Polars streaming engine
- Native Expressions: All operations use vectorized Polars expressions
- Memory Efficient: Processes data in chunks
# Large file masking (handled efficiently)
truthound mask large_file.parquet -o masked_large.parquet --strategy hash