GitHub user gfphoenix78 added a comment to the discussion: PAX Storage:
Questions for PAX developers
> 1. Can PAX detect low-cardinality bloom filters at table creation?
Check n_distinct from pg_stats
Warn or error if cardinality < threshold (e.g., 1000)
Updating bits of bloom filter is synchronous when INSERTING tuples.
The code doesn't know the future tuples to insert. The practicable
way is to update bloom filter after tuples completes insertion. But it
conflicts with the current design that micro partition file contains
both tuples and bloom filter/minmax info, etc.
> 2. Can bloom filter size be estimated before creation?
Show estimated overhead: "Bloom filters will add ~800 MB"
Allow users to make informed decisions
No. The storage size of bloom filter grows linearly with the number of
files/groups. We can't know how much data will insert when creating a table.
> 3. Should there be a cardinality validation gate in PAX code?
CREATE TABLE foo (...) USING pax WITH (
bloomfilter_columns='region' -- 5 unique values
);
-- WARNING: Column 'region' has low cardinality (5).
-- Bloom filters are ineffective for <1000 unique values.
-- This will waste ~50 MB per micro-partition file.
-- Consider using minmax_columns instead.
NO, we can't know the cardinality of a column in advance.
> 4. Can EXPLAIN ANALYZE show bloom filter effectiveness?
```
-> PAX Scan on foo
Bloom Filter: region (5 unique values)
Files Scanned: 8 / 8 (0% skipped - bloom filter ineffective)
Bloom Filter Overhead: 400 MB wasted
```
Good suggestion. We can see the filtered file/groups from the log if
`pax.enable_debug` is enabled now.
GitHub link:
https://github.com/apache/cloudberry/discussions/1421#discussioncomment-14827267
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]