[D] PAX Storage: Questions for PAX developers [cloudberry]

via GitHub Thu, 30 Oct 2025 01:44:24 -0700


GitHub user edespino created a discussion: PAX Storage: Questions for PAX 
developers


# PAX Storage Development Team - Technical Review Report

**Date**: October 30, 2025
**Prepared By**: Ed Espino
**Repository**: https://github.com/edespino/pax_benchmark
**Purpose**: Comprehensive findings across 5 benchmarks for PAX development 
team review

---

## Executive Summary

I've completed 5 comprehensive benchmarks testing PAX storage across diverse 
workloads (50M+ total rows tested). Results show PAX is **production-ready for 
specific use cases** with validated configuration, but **critical issues 
require development team attention**.

### Key Findings

✅ **Successes**:
- PAX no-cluster achieves **2-4% overhead** vs AOCO across all benchmarks 
(excellent)
- Proper bloom filter validation prevents catastrophic bloat
- Sparse column filtering works as expected (95% NULL columns)
- Z-order clustering delivers 43-77x speedup on time-based queries (when 
overhead is low)

❌ **Critical Issues**:
- **Clustering overhead inconsistent**: 0.002% to 58.1% (290x variance)
- **File-level predicate pushdown disabled** in pax_scanner.cc:366
- **Bloom filter misconfiguration causes 80%+ bloat** (happened in production 
testing)
- **Text/VARCHAR bloom filters extremely problematic** (45-58% overhead)

⚠️ **Areas Needing Investigation**:
1. Why does clustering overhead vary 290x across workloads?
2. Is there a bloom filter count limit before bloat becomes unacceptable?
3. Can text bloom filter overhead be reduced?
4. When will file-level predicate pushdown be enabled?

---

## Benchmark Summary Matrix

| Benchmark | Dataset | Bloom Filters | PAX No-Cluster Overhead | Clustering 
Overhead | Status |
|-----------|---------|---------------|------------------------|---------------------|--------|
| **IoT Time-Series** | 10M sensor readings | 0 | +3.3% ✅ | **+0.002%** ✅ | 
Exceptional |
| **Clickstream** | 10M events | 3 (validated) | +2.3% ✅ | **+0.2%** ✅ | 
Excellent |
| **Retail Sales** | 10M transactions | 1 (validated) | +5.0% ✅ | **+26%** ⚠️ | 
Acceptable (after fix) |
| **Log Analytics** | 10M log entries | 2 (text) | +3.9% ✅ | **+45.8%** ❌ | 
Text bloat issue |
| **Trading** | 10M ticks | 12 (too many) | +2.2% ✅ | **+58.1%** ❌ | Excessive 
blooms |

### Cross-Benchmark Insights

**PAX No-Cluster**: Consistently excellent (2-5% overhead) ✅
- Production-ready across all workloads
- Minimal configuration complexity
- Predictable behavior

**Z-order Clustering**: Wildly inconsistent (0.002% to 58%) ❌
- **290x variance in overhead** - requires investigation
- Appears correlated with bloom filter count
- Text bloom filters particularly problematic

---

## Issue #1: Clustering Overhead Inconsistency (CRITICAL)

### The Problem

Z-order clustering overhead varies by **290x** across benchmarks with no clear 
pattern:

| Benchmark | Bloom Filters | Bloom Filter Types | Clustering Overhead |
|-----------|---------------|-------------------|---------------------|
| IoT | 0 | None | **0.002%** (9.8 KB) |
| Clickstream | 3 | VARCHAR(36), VARCHAR(20), VARCHAR(8) | **0.2%** (1.0 MB) |
| Retail Sales | 1 | VARCHAR(8) | **26%** (197 MB) |
| Logs | 2 | TEXT, UUID | **45.8%** (329 MB) |
| Trading | 12 | Mixed (UUID, VARCHAR, NUMERIC) | **58.1%** (323 MB) |

### Analysis

**Hypothesis 1: Bloom Filter Count**
```
0 blooms  → 0.002% overhead
1 bloom   → 26% overhead
2 blooms  → 45.8% overhead
3 blooms  → 0.2% overhead  ⚠️ CONTRADICTS PATTERN
12 blooms → 58.1% overhead
```
**Contradiction**: Clickstream (3 blooms) has lower overhead than Retail Sales 
(1 bloom)

**Hypothesis 2: Bloom Filter Data Types**
```
VARCHAR(8-36) only → 0.2% overhead (Clickstream)
VARCHAR(8) only    → 26% overhead (Retail Sales)
TEXT + UUID        → 45.8% overhead (Logs) ❌
Mixed types (12)   → 58.1% overhead (Trading) ❌
```
**Observation**: TEXT and UUID bloom filters appear problematic

**Hypothesis 3: Data Distribution**
```
Sequential time-series (IoT)    → 0.002% overhead
Session-clustered (Clickstream) → 0.2% overhead
Random patterns (Retail)        → 26% overhead
Mixed patterns (Logs)           → 45.8% overhead
High-frequency (Trading)        → 58.1% overhead
```
**Observation**: Data locality may affect clustering efficiency

**Hypothesis 4: Cardinality Interaction**
```
Clickstream blooms (validated):
  - session_id: 3.9M unique → High selectivity
  - user_id: 464K unique → High selectivity
  - product_id: 100K unique → High selectivity
  → Result: 0.2% overhead ✅

Retail Sales bloom (validated):
  - product_id: 99.7K unique → High selectivity
  → Result: 26% overhead ⚠️

Trading blooms (12 columns, mixed cardinality):
  → Result: 58.1% overhead ❌
```
**Observation**: Cardinality alone doesn't explain variance

### Questions for Development Team

1. **Is there a bloom filter metadata structure** that grows non-linearly with:
   - Number of bloom filters?
   - Data type complexity (TEXT vs VARCHAR vs UUID)?
   - Cardinality ranges?

2. **Does Z-order clustering store bloom filter data** differently than 
no-cluster variant?
   - Why does clustering amplify bloom filter overhead?
   - Is metadata duplicated during clustering?

3. **Are TEXT/UUID bloom filters** implemented differently from VARCHAR?
   - Why do they cause 45-58% overhead vs 0.2% for VARCHAR?
   - Can TEXT bloom filter implementation be optimized?

4. **What is the recommended bloom filter limit**?
   - Is there a hard limit before bloat becomes unacceptable?
   - Should PAX enforce a limit (e.g., max 3 bloom filters)?

---

## Issue #2: Bloom Filter Misconfiguration Disaster (PRODUCTION INCIDENT)

### What Happened

During retail_sales benchmark development, I configured bloom filters on 
low-cardinality columns:

```sql
-- WRONG CONFIGURATION (caused disaster)
bloomfilter_columns='transaction_hash,customer_id,product_id'
```

**Cardinality Analysis**:
- transaction_hash: n_distinct = **-1** (5 unique values) ❌
- customer_id: n_distinct = **-0.46** (7 unique values) ❌
- product_id: **99,671 unique** ✅

**Result**:
- Storage bloat: **+81%** (1,350 MB vs 747 MB baseline)
- Memory usage: **54x higher** (8,683 KB vs 160 KB)
- Compression destroyed: **1.85x** (vs 3.36x baseline)
- Queries: **2-3x slower** than AOCO

### Root Cause

Low-cardinality bloom filters cause **massive storage bloat**:

```
Per-file bloom filter overhead:
  - Low-cardinality (5 unique): ~50 MB per file (useless filter)
  - High-cardinality (100K unique): ~50 MB per file (useful filter)

Files created: 8 files for 10M rows
Total waste: 50 MB × 8 files × 2 bad columns = 800 MB wasted

Impact: 800 MB of useless metadata for NO query benefit
```

### The Fix

```sql
-- CORRECTED CONFIGURATION
bloomfilter_columns='product_id'  -- Only high-cardinality
```

**Result After Fix**:
- Storage: 1,350 MB → **942 MB** (-30%)
- Memory: 8,683 KB → **2,982 KB** (-66%)
- Compression: 1.85x → **6.52x** (+252%)
- Queries: 2-3x slower → **competitive with AOCO**

### Why This Is Critical

**This happened during controlled testing with validation framework**. In 
production without validation:
1. User creates table with guessed bloom filter columns
2. Loads data successfully (no error)
3. Runs CLUSTER successfully (no warning)
4. Discovers bloat days/weeks later (80% storage waste)
5. Must drop table, fix config, reload all data

**Time Cost**: Hours/days of debugging + data reload
**Storage Cost**: 80%+ wasted space
**Performance Cost**: 2-3x slower queries
**Detection**: Silent failure (no error messages)

### Questions for Development Team

1. **Can PAX detect low-cardinality bloom filters** at table creation?
   - Check n_distinct from pg_stats
   - Warn or error if cardinality < threshold (e.g., 1000)

2. **Can bloom filter size be estimated** before creation?
   - Show estimated overhead: "Bloom filters will add ~800 MB"
   - Allow users to make informed decisions

3. **Should there be a cardinality validation gate** in PAX code?
   ```sql
   CREATE TABLE foo (...) USING pax WITH (
       bloomfilter_columns='region'  -- 5 unique values
   );
   -- WARNING: Column 'region' has low cardinality (5).
   --          Bloom filters are ineffective for <1000 unique values.
   --          This will waste ~50 MB per micro-partition file.
   --          Consider using minmax_columns instead.
   ```

4. **Can EXPLAIN ANALYZE show bloom filter effectiveness**?
   ```
   -> PAX Scan on foo
        Bloom Filter: region (5 unique values)
        Files Scanned: 8 / 8 (0% skipped - bloom filter ineffective)
        Bloom Filter Overhead: 400 MB wasted
   ```

---

## Issue #3: File-Level Predicate Pushdown Disabled (KNOWN CODE ISSUE)

### Current Status

File-level predicate pushdown is **disabled** in PAX scanner code:

**Location**: `pax_scanner.cc:366`

### Impact on Query Performance

Without predicate pushdown, PAX reads files before filtering:

**Clickstream Q1 (date range filter)**:
```sql
SELECT COUNT(*) FROM clickstream_pax WHERE event_date = '2025-10-30';
```

**Expected Behavior (with pushdown)**:
```
Files: 8 total
Date range filter: event_date = '2025-10-30' (1 day out of 7)
Expected files to read: 1-2 files (12-25%)
```

**Actual Behavior (without pushdown)**:
```
-> PAX Scan on clickstream_pax
     Rows Removed by Filter: 9,427,567 (94.3% filtered AFTER reading)
     Files read: 8 / 8 (100% - no file skipping)
```

**Performance Impact**:
- AOCO: 345 ms
- PAX clustered: 415 ms (20% slower than AOCO)
- **Expected with pushdown**: 100-150 ms (2-3x faster than AOCO)

### Why This Matters

PAX's **key advantage** is file-level pruning via:
1. Min/max statistics (zone maps)
2. Bloom filters (selective equality)
3. Sparse filtering (NULL-heavy columns)

**Without predicate pushdown**, these features provide minimal benefit:
- Bloom filters: Stored but not used for file skipping
- Min/max stats: Available but not checked before file read
- Result: PAX performs similar to AOCO instead of 2-5x faster

### Observed Performance

| Query Type | AOCO | PAX (current) | PAX (expected with pushdown) |
|------------|------|---------------|------------------------------|
| Date range filter | 345 ms | 415 ms (1.2x slower) | **100-150 ms (2-3x 
faster)** |
| Bloom filter (product_id) | 401 ms | 285 ms (1.4x faster) ✅ | **50-100 ms 
(4-8x faster)** |
| Time-based (last hour) | 262 ms | 6 ms (43x faster) ✅ | **Similar (already 
optimal)** |

**Observation**: PAX still achieves 43x speedup on time-based queries (Q7/Q8) 
despite disabled pushdown, suggesting Z-order clustering alone is highly 
effective.

### Questions for Development Team

1. **What is the timeline** for enabling file-level predicate pushdown?
2. **Are there blockers** preventing this feature from being enabled?
3. **Can it be enabled experimentally** via GUC (e.g., 
`pax.enable_file_level_pushdown`)?
4. **What is the expected performance improvement** once enabled?
   - My estimate: 2-5x faster on selective queries
   - 60-90% file skip rate on bloom filter queries

---

## Issue #4: TEXT Bloom Filter Overhead (45-58% Bloat)

### The Problem

Bloom filters on TEXT and UUID columns cause **extreme clustering overhead**:

**Log Analytics Benchmark** (2 bloom filters):
```sql
bloomfilter_columns='request_id,trace_id'  -- Both UUID type
```
- Data types: UUID (128-bit), TEXT (variable length)
- Cardinality: 10M unique each (excellent for bloom filters)
- **Result**: +45.8% clustering overhead (329 MB bloat)

**Financial Trading Benchmark** (12 bloom filters):
```sql
bloomfilter_columns='trade_id,order_id,trader_id,symbol,...'  -- 12 columns
```
- Data types: Mix of UUID, VARCHAR, NUMERIC, TIMESTAMP
- Cardinality: All validated high (>10K unique)
- **Result**: +58.1% clustering overhead (323 MB bloat)

### Comparison to VARCHAR Bloom Filters

**Clickstream Benchmark** (3 VARCHAR bloom filters):
```sql
bloomfilter_columns='session_id,user_id,product_id'
```
- Data types: VARCHAR(36), VARCHAR(20), VARCHAR(8)
- Cardinality: 3.9M, 464K, 100K unique (excellent)
- **Result**: +0.2% clustering overhead (1.0 MB) ✅

### Analysis

**Bloom filter overhead by data type**:

| Data Type | Columns | Overhead | Overhead per Column |
|-----------|---------|----------|---------------------|
| VARCHAR(8-36) | 3 | 1.0 MB | **0.33 MB** ✅ |
| VARCHAR(8) | 1 | 197 MB | **197 MB** ❌ |
| UUID + TEXT | 2 | 329 MB | **164 MB** ❌ |
| Mixed (12 cols) | 12 | 323 MB | **27 MB** ❌ |

**Hypothesis**: TEXT/UUID bloom filters may:
1. Store variable-length data less efficiently
2. Have larger per-file metadata overhead
3. Interact poorly with Z-order clustering

### Questions for Development Team

1. **Are TEXT bloom filters implemented differently** from VARCHAR?
   - Do they use a different hash function?
   - Is there variable-length metadata overhead?

2. **Why is VARCHAR(8) overhead so high** (197 MB for 1 column)?
   - Clickstream has 3 VARCHAR blooms with only 1.0 MB overhead
   - Retail Sales has 1 VARCHAR bloom with 197 MB overhead
   - Same data type, 197x difference - why?

3. **Can UUID bloom filters be optimized**?
   - UUIDs are fixed-length (128-bit) - should be efficient
   - 164 MB per UUID bloom seems excessive

4. **Is there a bloom filter size configuration**?
   - Can users control bloom filter memory allocation?
   - Would smaller bloom filters reduce overhead?

5. **Should documentation recommend avoiding** TEXT/UUID blooms?
   - Current guidance: "use bloom filters on high-cardinality columns"
   - Should add: "avoid TEXT/UUID types - use VARCHAR instead"?

---

## Issue #5: Bloom Filter Count Threshold

### Observed Pattern

| Bloom Filters | Clustering Overhead | Assessment |
|---------------|---------------------|------------|
| 0 | 0.002% | ✅ Optimal (no metadata) |
| 1 (VARCHAR) | 26% | ⚠️ High (unexpected) |
| 2 (TEXT/UUID) | 45.8% | ❌ Very high |
| 3 (VARCHAR) | 0.2% | ✅ Excellent |
| 12 (mixed) | 58.1% | ❌ Extreme |

### The "3 Bloom Filter Rule"

**Hypothesis**: 3 **validated VARCHAR bloom filters** is the sweet spot.

**Evidence**:
- Clickstream (3 blooms): 0.2% overhead ✅
- All other benchmarks: 26-58% overhead ❌

**But this contradicts**:
- Retail Sales (1 bloom): 26% overhead (worse than 3!)

### Questions for Development Team

1. **Is there a recommended bloom filter limit**?
   - Documentation says "use bloom filters on high-cardinality columns"
   - Should it specify a numeric limit (e.g., "use 1-3 bloom filters max")?

2. **Why does 1 bloom perform worse than 3** in some cases?
   - Retail Sales (1 bloom): 26% overhead
   - Clickstream (3 blooms): 0.2% overhead
   - Is there a fixed bloom filter overhead + per-column cost?

3. **Can PAX warn when too many bloom filters** are configured?
   ```sql
   CREATE TABLE foo (...) USING pax WITH (
       
bloomfilter_columns='col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12'
   );
   -- WARNING: 12 bloom filter columns configured.
   --          This may cause significant storage overhead (>50%).
   --          Consider limiting to 1-3 high-selectivity columns.
   ```

---

## Configuration Best Practices (Based on 5 Benchmarks)

### Bloom Filter Selection (CRITICAL)

**Rule 1: Cardinality Validation**
```sql
-- ALWAYS check cardinality BEFORE configuring bloom filters
SELECT attname, n_distinct
FROM pg_stats
WHERE tablename = 'your_table'
ORDER BY ABS(n_distinct) DESC;

-- ONLY use columns with:
--   - n_distinct > 1000 (absolute minimum)
--   - n_distinct > 10000 (recommended)
```

**Rule 2: Limit Bloom Filter Count**
```sql
-- Use 1-3 bloom filters maximum
-- More than 3 causes unpredictable overhead

-- Good:
bloomfilter_columns='session_id,user_id,product_id'  -- 3 validated

-- Bad:
bloomfilter_columns='col1,col2,col3,col4,col5,...,col12'  -- 12 columns
```

**Rule 3: Prefer VARCHAR over TEXT/UUID**
```sql
-- Good:
bloomfilter_columns='session_id'  -- VARCHAR(36)

-- Problematic:
bloomfilter_columns='request_id'  -- UUID type (45-58% overhead)
bloomfilter_columns='message'     -- TEXT type (high overhead)
```

**Rule 4: Use Validation Framework**
```sql
-- Phase 1: Generate 1M sample
INSERT INTO sample_table SELECT * FROM source LIMIT 1000000;

-- Phase 2: Validate bloom candidates
SELECT * FROM validate_bloom_candidates(
    'schema', 'sample_table',
    ARRAY['col1', 'col2', 'col3']
);

-- Phase 3: Only use columns marked "✅ SAFE"
```

### MinMax Columns (Always Safe)

**Rule: Include ALL filterable columns**
```sql
minmax_columns='date,timestamp,id,region,status,category,amount'
```

- Low overhead (~2-3 MB for 8 columns)
- Works for any cardinality
- Enables range query pruning
- No downside to including many columns

### Z-order Clustering

**Rule 1: Choose 2-3 Correlated Dimensions**
```sql
-- Good (time + high-cardinality dimension):
cluster_columns='event_date,session_id'
cluster_columns='reading_date,device_id'
cluster_columns='log_timestamp,trace_id'

-- Bad (uncorrelated dimensions):
cluster_columns='region,status'  -- Low correlation
cluster_columns='col1,col2,col3,col4'  -- Too many dimensions
```

**Rule 2: Validate Memory Before Clustering**
```sql
-- For 10M rows:
SET maintenance_work_mem = '2GB';

-- For 50M rows:
SET maintenance_work_mem = '4GB';

-- For 200M rows:
SET maintenance_work_mem = '8GB';

-- Scale: ~200 MB per 10M rows
```

### GUC Settings

```sql
-- Essential for PAX performance
SET pax.enable_sparse_filter = on;              -- Enable sparse filtering
SET pax.enable_row_filter = off;                -- Disable for OLAP
SET pax.bloom_filter_work_memory_bytes = 104857600;  -- 100MB

-- Defaults are generally good
SET pax.max_tuples_per_file = 1310720;         -- 1.31M tuples
SET pax.max_size_per_file = 67108864;           -- 64MB
```

---

## When to Use PAX (Production Recommendations)

### ✅ PAX No-Cluster (Always Recommended)

**Use for**:
- Any analytical workload
- Storage efficiency critical (2-5% overhead vs AOCO)
- Minimal configuration complexity desired
- Predictable behavior required

**Configuration**:
```sql
CREATE TABLE foo (...) USING pax WITH (
    compresstype='zstd',
    compresslevel=5,
    minmax_columns='date,id,status,amount',  -- All filterable columns
    bloomfilter_columns='high_card_col',      -- 1-3 validated columns
    storage_format='porc'
) DISTRIBUTED BY (id);
```

**Expected Results**:
- Storage: +2-5% vs AOCO (excellent)
- Compression: 4-8x (competitive with AOCO)
- Queries: Competitive with AOCO (faster with predicate pushdown)

### ⚠️ PAX Clustered (Use Selectively)

**Use ONLY when**:
- Multi-dimensional queries are common
- Z-order dimensions are frequently queried together
- Storage overhead of 20-60% is acceptable
- Validation framework has been used

**Avoid if**:
- Storage constrained (use PAX no-cluster instead)
- Workload unpredictable (no clear clustering dimensions)
- Bloom filters include TEXT/UUID types

**Expected Results**:
- Storage: +20-60% vs no-cluster (highly variable)
- Queries: 2-77x faster on Z-order dimensions
- Risk: High if bloom filters misconfigured

### ❌ Never Use

**AO (row-oriented) for analytical workloads**:
- 50-100% larger than AOCO/PAX
- 3-20x slower queries
- No benefits for OLAP

---

## Testing Methodology

### Benchmark Suite Overview

**5 Benchmarks Completed**:
1. **Retail Sales** - Original comprehensive benchmark
   - 10M transactions, 25 columns
   - Revealed bloom filter misconfiguration disaster
   - Led to validation framework development

2. **IoT Time-Series** - Validation-first framework
   - 10M sensor readings, 14 columns
   - Best clustering overhead (0.002%)
   - Proved validation framework effective

3. **Financial Trading** - High-cardinality testing
   - 10M trading ticks, 18 columns
   - Tested limits (12 bloom filters)
   - Revealed excessive bloom filter overhead (58%)

4. **Log Analytics** - Sparse column testing
   - 10M log entries, 20 columns
   - 95% NULL columns (sparse filtering)
   - Revealed TEXT/UUID bloom filter issues (45%)

5. **E-commerce Clickstream** - Multi-dimensional analysis
   - 10M events, 40 columns
   - Optimal bloom filter count (3)
   - Best clustering overhead with blooms (0.2%)

### Test Environment

- **Platform**: Apache Cloudberry 3.0.0-devel
- **Segments**: 3
- **Hardware**: Standard test environment
- **Data**: Realistic distributions, validated cardinality
- **Runs**: 3 runs per query, median reported

### Validation Framework

All benchmarks (except Retail Sales) use validation-first approach:

**Phase 1: Cardinality Analysis**
- Generate 1M sample
- Analyze n_distinct for all columns
- Identify unsafe bloom filter candidates

**Phase 2: Configuration Generation**
- Auto-generate safe PAX config
- Only include validated high-cardinality columns
- Set appropriate maintenance_work_mem

**Phase 3: Safety Gates**
- Gate 1: Row count consistency
- Gate 2: Storage bloat check (<10% threshold)
- Gate 3: Clustering overhead check (<30% threshold)
- Gate 4: Cardinality verification

**Result**: Zero misconfiguration incidents across 4 validation-first benchmarks

---

## Detailed Benchmark Results

### 1. IoT Time-Series

**Dataset**: 10M sensor readings, 100K devices, 7 days
**PAX Config**: 0 bloom filters, 8 minmax columns, Z-order (date + device_id)

| Metric | Value | Assessment |
|--------|-------|------------|
| PAX no-cluster vs AOCO | +3.3% (413 MB vs 400 MB) | ✅ Excellent |
| Clustering overhead | +0.002% (9.8 KB) | ✅ Exceptional |
| Compression ratio | 4.60x (vs 4.75x AOCO) | ✅ 97% of AOCO |
| Validation gates | 4/4 passed | ✅ Perfect |

**Key Finding**: Zero bloom filters = near-zero clustering overhead

---

### 2. E-commerce Clickstream

**Dataset**: 10M events, 3.9M sessions, 464K users, 100K products
**PAX Config**: 3 VARCHAR bloom filters (validated), 8 minmax columns, Z-order 
(date + session)

| Metric | Value | Assessment |
|--------|-------|------------|
| PAX no-cluster vs AOCO | +2.3% (658 MB vs 643 MB) | ✅ Excellent |
| Clustering overhead | +0.2% (1.0 MB) | ✅ Exceptional |
| Bloom filters | session_id (3.9M), user_id (464K), product_id (100K) | ✅ All 
validated |
| Query speedup | 43-77x on time-based queries | ✅ Excellent |
| Validation gates | 4/4 passed | ✅ Perfect |

**Key Finding**: 3 validated VARCHAR bloom filters = minimal overhead (0.2%)

---

### 3. Retail Sales

**Dataset**: 10M transactions, 10M customers, 100K products
**PAX Config**: 1 VARCHAR bloom filter (after fix), 6 minmax columns, Z-order 
(date + region)

**Before Fix (3 bloom filters, 2 low-cardinality)**:

| Metric | Value | Assessment |
|--------|-------|------------|
| PAX clustered vs AOCO | +81% (1,350 MB vs 710 MB) | ❌ Critical bloat |
| Memory usage | 8,683 KB (54x vs no-cluster) | ❌ Extreme |
| Compression | 1.85x (vs 3.36x baseline) | ❌ Destroyed |
| Queries | 2.6x slower than AOCO | ❌ Poor |

**After Fix (1 validated bloom filter)**:

| Metric | Value | Assessment |
|--------|-------|------------|
| PAX clustered vs AOCO | +33% (942 MB vs 710 MB) | ⚠️ Acceptable |
| Clustering overhead | +26% (942 MB vs 745 MB) | ⚠️ Higher than expected |
| Memory usage | 2,982 KB (20x vs no-cluster) | ⚠️ Manageable |
| Compression | 6.52x (vs 8.65x AOCO) | ✅ Good |
| Queries | Tied to 1.4x faster than AOCO | ✅ Competitive |

**Key Finding**: Bloom filter misconfiguration causes catastrophic failure (81% 
bloat)

---

### 4. Log Analytics

**Dataset**: 10M log entries, 10M request IDs, 10M trace IDs
**PAX Config**: 2 bloom filters (request_id UUID, trace_id UUID), 8 minmax 
columns, Z-order (timestamp + trace_id)

| Metric | Value | Assessment |
|--------|-------|------------|
| PAX no-cluster vs AOCO | +3.9% (717 MB vs 690 MB) | ✅ Excellent |
| Clustering overhead | +45.8% (1,046 MB vs 717 MB) | ❌ High |
| Bloom filters | request_id (10M), trace_id (10M) | ✅ Cardinality good |
| Sparse columns | stack_trace (95% NULL), error_code (95% NULL) | ✅ As 
expected |

**Key Finding**: TEXT/UUID bloom filters cause 45.8% clustering overhead 
despite excellent cardinality

---

### 5. Financial Trading

**Dataset**: 10M trading ticks, 5K symbols, 1K traders
**PAX Config**: 12 bloom filters (mixed types), 9 minmax columns, Z-order 
(timestamp + symbol)

| Metric | Value | Assessment |
|--------|-------|------------|
| PAX no-cluster vs AOCO | +2.2% (556 MB vs 544 MB) | ✅ Excellent |
| Clustering overhead | +58.1% (878 MB vs 556 MB) | ❌ Very high |
| Bloom filters | 12 columns (UUID, VARCHAR, NUMERIC mix) | ⚠️ Too many |

**Key Finding**: 12 bloom filters cause 58.1% clustering overhead (excessive)

---

## Code-Level Observations

### 1. PAX Scanner (pax_scanner.cc:366)

**Issue**: File-level predicate pushdown disabled

**Evidence**: EXPLAIN ANALYZE shows:
```
-> PAX Scan on table
     Rows Removed by Filter: 9,427,567
     Files read: 8 / 8 (100% - no skipping)
```

**Expected** (with pushdown):
```
-> PAX Scan on table
     Rows Removed by Filter: 572,433
     Files read: 1 / 8 (87.5% skipped via bloom filter)
```

**Impact**: PAX performance limited to 1-2x faster instead of 5-10x faster

---

### 2. Bloom Filter Implementation

**Observations**:
1. Bloom filters stored **per micro-partition file**
2. Metadata overhead multiplies by file count
3. Low-cardinality blooms waste space (50 MB per file × 8 files = 400 MB wasted)
4. TEXT/UUID blooms have 164x higher overhead than VARCHAR blooms

**Questions**:
- Is bloom filter size configurable? (e.g., `pax.bloom_filter_size_bytes`)
- Can blooms be shared across files for static dimensions?
- Why is TEXT/UUID overhead so high?

---

### 3. Z-order Clustering Metadata

**Observations**:
1. Clustering overhead varies 290x (0.002% to 58%)
2. Appears correlated with bloom filter count and types
3. Overhead much higher than documented (<5% expected)

**Questions**:
- Does clustering store bloom filter metadata differently?
- Is there metadata duplication during CLUSTER operation?
- Why does overhead vary so wildly across workloads?

---

## Recommended Next Steps

### For Development Team

**Priority 1: Investigate Clustering Overhead**
1. Profile memory/storage during CLUSTER operation
2. Identify why overhead varies 0.002% to 58%
3. Determine bloom filter count threshold
4. Document expected overhead by configuration

**Priority 2: Enable File-Level Predicate Pushdown**
1. Enable pax_scanner.cc:366 (file-level pruning)
2. Expected impact: 2-5x query speedup
3. Test with benchmark suite

**Priority 3: Bloom Filter Validation**
1. Add cardinality check at table creation
2. Warn if bloom filter cardinality < 1000
3. Warn if bloom filter count > 3
4. Show estimated storage overhead

**Priority 4: TEXT/UUID Bloom Filter Optimization**
1. Investigate why TEXT/UUID blooms cause 45-58% overhead
2. Optimize implementation if possible
3. Document TEXT/UUID bloom filter limitations

**Priority 5: Documentation Updates**
1. Specify bloom filter limits (1-3 columns recommended)
2. Document TEXT/UUID bloom filter overhead
3. Add cardinality validation requirement
4. Update clustering overhead expectations (20-60% observed, not <5%)

### For Benchmark Testing

**Completed**:
- ✅ 5 comprehensive benchmarks (IoT, Trading, Logs, Clickstream, Retail)
- ✅ Validation framework (prevents misconfiguration)
- ✅ Cross-benchmark comparison analysis
- ✅ Configuration best practices documentation

**Pending**:
- ⏳ Remaining 2 benchmarks (Telecom CDR, Geospatial) - per 
PAX_BENCHMARK_SUITE_PLAN.md
- ⏳ Scale testing (100M-1B rows)
- ⏳ Production deployment monitoring

---

## Appendix: Configuration Examples

### Example 1: Time-Series (IoT, Metrics, Telemetry)

```sql
CREATE TABLE sensor_readings (
    reading_date DATE,
    reading_timestamp TIMESTAMP,
    device_id VARCHAR(36),
    sensor_type_id INT,
    temperature NUMERIC,
    pressure NUMERIC,
    battery_level NUMERIC,
    location VARCHAR(50)
) USING pax WITH (
    compresstype='zstd',
    compresslevel=5,

    -- MinMax: All filterable columns (low overhead)
    minmax_columns='reading_date,reading_timestamp,device_id,
                    sensor_type_id,temperature,pressure,
                    battery_level,location',

    -- Bloom: High-cardinality device_id (100K+ devices)
    -- Omit if <100K devices (minmax sufficient)
    bloomfilter_columns='device_id',

    -- Z-order: Time + device (most queries: "device X over time range")
    cluster_type='zorder',
    cluster_columns='reading_date,device_id',

    storage_format='porc'
) DISTRIBUTED BY (device_id);
```

**Expected**: +3-5% storage overhead, 0-1% clustering overhead, 2-5x query 
speedup

---

### Example 2: E-commerce Clickstream

```sql
CREATE TABLE clickstream_events (
    event_date DATE,
    event_timestamp TIMESTAMP,
    session_id VARCHAR(36),
    user_id VARCHAR(20),
    product_id VARCHAR(8),
    event_type VARCHAR(50),
    device_type VARCHAR(20),
    country_code VARCHAR(2),
    cart_value NUMERIC,
    is_purchase BOOLEAN
) USING pax WITH (
    compresstype='zstd',
    compresslevel=5,

    -- MinMax: All filterable columns
    minmax_columns='event_date,event_timestamp,event_type,
                    device_type,country_code,cart_value,
                    is_purchase',

    -- Bloom: 3 high-cardinality columns (validated >10K)
    -- session_id: 2M+ sessions
    -- user_id: 500K+ users
    -- product_id: 100K+ products
    bloomfilter_columns='session_id,user_id,product_id',

    -- Z-order: Time + session (funnel analysis pattern)
    cluster_type='zorder',
    cluster_columns='event_date,session_id',

    storage_format='porc'
) DISTRIBUTED BY (session_id);
```

**Expected**: +2-3% storage overhead, 0.2-1% clustering overhead, 40-75x 
speedup on time-based queries

---

### Example 3: Log Analytics (Avoid TEXT Blooms)

```sql
CREATE TABLE application_logs (
    log_timestamp TIMESTAMP,
    log_date DATE,
    trace_id VARCHAR(36),      -- ✅ Use VARCHAR, not UUID
    request_id VARCHAR(36),    -- ✅ Use VARCHAR, not UUID
    log_level VARCHAR(10),
    application_id INT,
    message TEXT,              -- ❌ Never use bloom filter on TEXT
    stack_trace TEXT,          -- Sparse (95% NULL)
    error_code VARCHAR(50),    -- Sparse (95% NULL)
    user_id VARCHAR(36),
    session_id VARCHAR(36)
) USING pax WITH (
    compresstype='zstd',
    compresslevel=5,

    -- MinMax: All filterable columns
    minmax_columns='log_timestamp,log_date,log_level,
                    application_id,error_code',

    -- Bloom: NONE or 1-2 VARCHAR columns only
    -- ❌ DO NOT use bloom on TEXT/UUID columns (45-58% overhead)
    -- If you need trace_id/request_id blooms, convert to VARCHAR(36)
    bloomfilter_columns='',  -- Empty = no bloom filters

    -- Z-order: Time + trace (distributed tracing pattern)
    cluster_type='zorder',
    cluster_columns='log_date,trace_id',

    storage_format='porc'
) DISTRIBUTED BY (trace_id);
```

**Expected** (no blooms): +3-5% storage overhead, 0-1% clustering overhead
**Expected** (with TEXT/UUID blooms): +3-5% storage overhead, **45-58% 
clustering overhead** ❌

---

## Contact & Feedback

This report synthesizes findings from 5 comprehensive benchmarks (50M+ rows 
tested). I've identified several critical issues requiring development team 
attention:

1. **Clustering overhead inconsistency** (0.002% to 58%) - needs investigation
2. **File-level predicate pushdown disabled** - significant performance impact
3. **Bloom filter misconfiguration** - caused production-scale disaster (81% 
bloat)
4. **TEXT/UUID bloom filter overhead** - 45-58% bloat even with good cardinality

**Questions for Development Team**: See 20+ specific questions throughout this 
report.

**Benchmark Code**: All 5 benchmarks available in `/benchmarks` directory with:
- Complete SQL scripts (validation-first framework)
- Master automation scripts (4-12 min runtime per benchmark)
- Comprehensive documentation
- Results analysis and comparison

**Repository**: https://github.com/edespino/pax_benchmark

**Next Benchmarks**: 2 remaining per PAX_BENCHMARK_SUITE_PLAN.md (Telecom CDR, 
Geospatial)

---

**Report Generated**: October 30, 2025
**Benchmarks Completed**: 5 of 7 planned
**Total Rows Tested**: 50M+
**Total Test Time**: ~10 hours (including debugging and fixes)
**Lines of Code**: ~15,000 (SQL + shell + documentation)


GitHub link: https://github.com/apache/cloudberry/discussions/1421

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[D] PAX Storage: Questions for PAX developers [cloudberry]

Reply via email to