Re: Compressing bloom filters

Adrian Garcia Badaracco via dev Tue, 07 Apr 2026 05:12:39 -0700

We've merged this into arrow-rs now. Anecdotally, my company has had this
in production for several weeks now with no issues and great results. We
plan to write a blog post
<https://github.com/apache/arrow-rs/issues/9671> which
should help any other implementation looking to adopt this feature.


On Wed, 1 Apr 2026 at 10:00, Adrian Garcia Badaracco <[email protected]>
wrote:

> For the case of DataFusion / arrow-rs the default NDV was already ~ max
> rows per row group. Thus implementing this doesn’t change memory usage at
> all. I wonder what the default NDV is for other implementations.
>
> > On Apr 1, 2026, at 9:29 AM, Andrew Lamb <[email protected]> wrote:
> >
> > I just reviewed the PR carefully and I think it is quite clever and
> > correct.
> >
> > My summary of the feature is that it means writers who want bloom filters
> > don't ever have to know / calculate the NDV (number of distinct values)
> for
> > a column. Instead the writer can build a bloom filter and adapt the size
> to
> > match a target false positive probability (FPP)
> >
> > The downside is, as Steve points out, that it requires more work / memory
> > during writing. However, it does *NOT* require calculating / estimating
> the
> > NDV which I think is a key benefit
> >
> > The blog[1] (referenced in the PR) is quite a good read
> >
> > [1]:
> >
> https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore
> >
> > On Tue, Mar 31, 2026 at 3:17 PM Steve Loughran <[email protected]>
> wrote:
> >
> >> in that case all that matters is the extra memory consumption during
> writes
> >> (not something to ignore...imagine many threads generating files at the
> >> same time, and the extra compression delay.
> >>
> >> On Tue, 31 Mar 2026 at 17:24, Adrian Garcia Badaracco via dev <
> >> [email protected]> wrote:
> >>
> >>> I think from a readers perspective there would be no indication of how
> >> the
> >>> bloom filters were created. The folded versions are identical to having
> >>> started with that size in the first place.
> >>>
> >>>> On Mar 31, 2026, at 10:36 AM, Steve Loughran <[email protected]>
> >>> wrote:
> >>>>
> >>>> Assuming it compresses before writing, you wouldn't be able to tell
> >> when
> >>>> you read a file how it was actually created, would you?
> >>>>
> >>>> On Tue, 31 Mar 2026 at 00:57, Micah Kornfield <[email protected]>
> >>> wrote:
> >>>>
> >>>>> Hi Adrian,
> >>>>> Very interesting idea, I don't recall seeing this used in any of the
> >>>>> reference implementations.  On the surface I agree it looks
> compatible
> >>> but
> >>>>> I need to think a little bit more deeply about it.
> >>>>>
> >>>>> Cheers,
> >>>>> Micah
> >>>>>
> >>>>> On Mon, Mar 30, 2026 at 3:27 PM Adrian Garcia Badaracco <
> >>>>> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> I think I've found a neat trick for making smaller bloom filters:
> >>>>>> https://github.com/apache/arrow-rs/pull/9628
> >>>>>>
> >>>>>> The idea is that you choose a largeish initial bloom filter size and
> >>> once
> >>>>>> you're done populating it you compress it by folding it onto itself
> >> if
> >>> it
> >>>>>> is sparse.
> >>>>>>
> >>>>>> Does anyone know if this trick is used in any other Parquet
> >>>>> implementation?
> >>>>>> As far as I can tell it is compatible with the spec and should cause
> >> no
> >>>>>> issues, but I haven't heard of anyone doing this before.
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Compressing bloom filters

Reply via email to