I just reviewed the PR carefully and I think it is quite clever and
correct.

My summary of the feature is that it means writers who want bloom filters
don't ever have to know / calculate the NDV (number of distinct values) for
a column. Instead the writer can build a bloom filter and adapt the size to
match a target false positive probability (FPP)

The downside is, as Steve points out, that it requires more work / memory
during writing. However, it does *NOT* require calculating / estimating the
NDV which I think is a key benefit

The blog[1] (referenced in the PR) is quite a good read

[1]:
https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore

On Tue, Mar 31, 2026 at 3:17 PM Steve Loughran <[email protected]> wrote:

> in that case all that matters is the extra memory consumption during writes
> (not something to ignore...imagine many threads generating files at the
> same time, and the extra compression delay.
>
> On Tue, 31 Mar 2026 at 17:24, Adrian Garcia Badaracco via dev <
> [email protected]> wrote:
>
> > I think from a readers perspective there would be no indication of how
> the
> > bloom filters were created. The folded versions are identical to having
> > started with that size in the first place.
> >
> > > On Mar 31, 2026, at 10:36 AM, Steve Loughran <[email protected]>
> > wrote:
> > >
> > > Assuming it compresses before writing, you wouldn't be able to tell
> when
> > > you read a file how it was actually created, would you?
> > >
> > > On Tue, 31 Mar 2026 at 00:57, Micah Kornfield <[email protected]>
> > wrote:
> > >
> > >> Hi Adrian,
> > >> Very interesting idea, I don't recall seeing this used in any of the
> > >> reference implementations.  On the surface I agree it looks compatible
> > but
> > >> I need to think a little bit more deeply about it.
> > >>
> > >> Cheers,
> > >> Micah
> > >>
> > >> On Mon, Mar 30, 2026 at 3:27 PM Adrian Garcia Badaracco <
> > >> [email protected]>
> > >> wrote:
> > >>
> > >>> I think I've found a neat trick for making smaller bloom filters:
> > >>> https://github.com/apache/arrow-rs/pull/9628
> > >>>
> > >>> The idea is that you choose a largeish initial bloom filter size and
> > once
> > >>> you're done populating it you compress it by folding it onto itself
> if
> > it
> > >>> is sparse.
> > >>>
> > >>> Does anyone know if this trick is used in any other Parquet
> > >> implementation?
> > >>> As far as I can tell it is compatible with the spec and should cause
> no
> > >>> issues, but I haven't heard of anyone doing this before.
> > >>>
> > >>
> >
> >
>

Reply via email to