We've merged this into arrow-rs now. Anecdotally, my company has had this in production for several weeks now with no issues and great results. We plan to write a blog post <https://github.com/apache/arrow-rs/issues/9671> which should help any other implementation looking to adopt this feature.
On Wed, 1 Apr 2026 at 10:00, Adrian Garcia Badaracco <[email protected]> wrote: > For the case of DataFusion / arrow-rs the default NDV was already ~ max > rows per row group. Thus implementing this doesn’t change memory usage at > all. I wonder what the default NDV is for other implementations. > > > On Apr 1, 2026, at 9:29 AM, Andrew Lamb <[email protected]> wrote: > > > > I just reviewed the PR carefully and I think it is quite clever and > > correct. > > > > My summary of the feature is that it means writers who want bloom filters > > don't ever have to know / calculate the NDV (number of distinct values) > for > > a column. Instead the writer can build a bloom filter and adapt the size > to > > match a target false positive probability (FPP) > > > > The downside is, as Steve points out, that it requires more work / memory > > during writing. However, it does *NOT* require calculating / estimating > the > > NDV which I think is a key benefit > > > > The blog[1] (referenced in the PR) is quite a good read > > > > [1]: > > > https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore > > > > On Tue, Mar 31, 2026 at 3:17 PM Steve Loughran <[email protected]> > wrote: > > > >> in that case all that matters is the extra memory consumption during > writes > >> (not something to ignore...imagine many threads generating files at the > >> same time, and the extra compression delay. > >> > >> On Tue, 31 Mar 2026 at 17:24, Adrian Garcia Badaracco via dev < > >> [email protected]> wrote: > >> > >>> I think from a readers perspective there would be no indication of how > >> the > >>> bloom filters were created. The folded versions are identical to having > >>> started with that size in the first place. > >>> > >>>> On Mar 31, 2026, at 10:36 AM, Steve Loughran <[email protected]> > >>> wrote: > >>>> > >>>> Assuming it compresses before writing, you wouldn't be able to tell > >> when > >>>> you read a file how it was actually created, would you? > >>>> > >>>> On Tue, 31 Mar 2026 at 00:57, Micah Kornfield <[email protected]> > >>> wrote: > >>>> > >>>>> Hi Adrian, > >>>>> Very interesting idea, I don't recall seeing this used in any of the > >>>>> reference implementations. On the surface I agree it looks > compatible > >>> but > >>>>> I need to think a little bit more deeply about it. > >>>>> > >>>>> Cheers, > >>>>> Micah > >>>>> > >>>>> On Mon, Mar 30, 2026 at 3:27 PM Adrian Garcia Badaracco < > >>>>> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> I think I've found a neat trick for making smaller bloom filters: > >>>>>> https://github.com/apache/arrow-rs/pull/9628 > >>>>>> > >>>>>> The idea is that you choose a largeish initial bloom filter size and > >>> once > >>>>>> you're done populating it you compress it by folding it onto itself > >> if > >>> it > >>>>>> is sparse. > >>>>>> > >>>>>> Does anyone know if this trick is used in any other Parquet > >>>>> implementation? > >>>>>> As far as I can tell it is compatible with the spec and should cause > >> no > >>>>>> issues, but I haven't heard of anyone doing this before. > >>>>>> > >>>>> > >>> > >>> > >> > >
