GitHub user gfphoenix78 added a comment to the discussion: PAX Storage: 
Questions for PAX developers

#### Issue 4

> 1 Are TEXT bloom filters implemented differently from VARCHAR?
Do they use a different hash function?
Is there variable-length metadata overhead?
Why is VARCHAR(8) overhead so high (197 MB for 1 column)?

No, the hash function is the same for all column types now.
There is no additional overhead for variable-length types.

> Is there a bloom filter size configuration?

The GUC pax.bloom_filter_work_memory_bytes may control the size of
the bloom filter meta structure. Note, the total storage overhead of bloom
filter is proportional to the number of micro-partition files/groups.

> Should documentation recommend avoiding TEXT/UUID blooms?
Current guidance: "use bloom filters on high-cardinality columns"
Should add: "avoid TEXT/UUID types - use VARCHAR instead"?

For string types, especially low cardinality, the storage overhead of bloom
filter is amplified. Because the column values may be compressed to small
size, but the bloom filter meta structure keeps the original size. The bloom
filter meta structure is not compressed yet.

GitHub link: 
https://github.com/apache/cloudberry/discussions/1421#discussioncomment-14827470

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to