I just reviewed the PR carefully and I think it is quite clever and correct.
My summary of the feature is that it means writers who want bloom filters don't ever have to know / calculate the NDV (number of distinct values) for a column. Instead the writer can build a bloom filter and adapt the size to match a target false positive probability (FPP) The downside is, as Steve points out, that it requires more work / memory during writing. However, it does *NOT* require calculating / estimating the NDV which I think is a key benefit The blog[1] (referenced in the PR) is quite a good read [1]: https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore On Tue, Mar 31, 2026 at 3:17 PM Steve Loughran <[email protected]> wrote: > in that case all that matters is the extra memory consumption during writes > (not something to ignore...imagine many threads generating files at the > same time, and the extra compression delay. > > On Tue, 31 Mar 2026 at 17:24, Adrian Garcia Badaracco via dev < > [email protected]> wrote: > > > I think from a readers perspective there would be no indication of how > the > > bloom filters were created. The folded versions are identical to having > > started with that size in the first place. > > > > > On Mar 31, 2026, at 10:36 AM, Steve Loughran <[email protected]> > > wrote: > > > > > > Assuming it compresses before writing, you wouldn't be able to tell > when > > > you read a file how it was actually created, would you? > > > > > > On Tue, 31 Mar 2026 at 00:57, Micah Kornfield <[email protected]> > > wrote: > > > > > >> Hi Adrian, > > >> Very interesting idea, I don't recall seeing this used in any of the > > >> reference implementations. On the surface I agree it looks compatible > > but > > >> I need to think a little bit more deeply about it. > > >> > > >> Cheers, > > >> Micah > > >> > > >> On Mon, Mar 30, 2026 at 3:27 PM Adrian Garcia Badaracco < > > >> [email protected]> > > >> wrote: > > >> > > >>> I think I've found a neat trick for making smaller bloom filters: > > >>> https://github.com/apache/arrow-rs/pull/9628 > > >>> > > >>> The idea is that you choose a largeish initial bloom filter size and > > once > > >>> you're done populating it you compress it by folding it onto itself > if > > it > > >>> is sparse. > > >>> > > >>> Does anyone know if this trick is used in any other Parquet > > >> implementation? > > >>> As far as I can tell it is compatible with the spec and should cause > no > > >>> issues, but I haven't heard of anyone doing this before. > > >>> > > >> > > > > >
