Re: Compressing bloom filters

Steve Loughran Tue, 31 Mar 2026 14:44:23 -0700

in that case all that matters is the extra memory consumption during writes
(not something to ignore...imagine many threads generating files at the
same time, and the extra compression delay.


On Tue, 31 Mar 2026 at 17:24, Adrian Garcia Badaracco via dev <
[email protected]> wrote:

> I think from a readers perspective there would be no indication of how the
> bloom filters were created. The folded versions are identical to having
> started with that size in the first place.
>
> > On Mar 31, 2026, at 10:36 AM, Steve Loughran <[email protected]>
> wrote:
> >
> > Assuming it compresses before writing, you wouldn't be able to tell when
> > you read a file how it was actually created, would you?
> >
> > On Tue, 31 Mar 2026 at 00:57, Micah Kornfield <[email protected]>
> wrote:
> >
> >> Hi Adrian,
> >> Very interesting idea, I don't recall seeing this used in any of the
> >> reference implementations.  On the surface I agree it looks compatible
> but
> >> I need to think a little bit more deeply about it.
> >>
> >> Cheers,
> >> Micah
> >>
> >> On Mon, Mar 30, 2026 at 3:27 PM Adrian Garcia Badaracco <
> >> [email protected]>
> >> wrote:
> >>
> >>> I think I've found a neat trick for making smaller bloom filters:
> >>> https://github.com/apache/arrow-rs/pull/9628
> >>>
> >>> The idea is that you choose a largeish initial bloom filter size and
> once
> >>> you're done populating it you compress it by folding it onto itself if
> it
> >>> is sparse.
> >>>
> >>> Does anyone know if this trick is used in any other Parquet
> >> implementation?
> >>> As far as I can tell it is compatible with the spec and should cause no
> >>> issues, but I haven't heard of anyone doing this before.
> >>>
> >>
>
>

Re: Compressing bloom filters

Reply via email to