For the case of DataFusion / arrow-rs the default NDV was already ~ max rows per row group. Thus implementing this doesn’t change memory usage at all. I wonder what the default NDV is for other implementations.
> On Apr 1, 2026, at 9:29 AM, Andrew Lamb <[email protected]> wrote: > > I just reviewed the PR carefully and I think it is quite clever and > correct. > > My summary of the feature is that it means writers who want bloom filters > don't ever have to know / calculate the NDV (number of distinct values) for > a column. Instead the writer can build a bloom filter and adapt the size to > match a target false positive probability (FPP) > > The downside is, as Steve points out, that it requires more work / memory > during writing. However, it does *NOT* require calculating / estimating the > NDV which I think is a key benefit > > The blog[1] (referenced in the PR) is quite a good read > > [1]: > https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore > > On Tue, Mar 31, 2026 at 3:17 PM Steve Loughran <[email protected]> wrote: > >> in that case all that matters is the extra memory consumption during writes >> (not something to ignore...imagine many threads generating files at the >> same time, and the extra compression delay. >> >> On Tue, 31 Mar 2026 at 17:24, Adrian Garcia Badaracco via dev < >> [email protected]> wrote: >> >>> I think from a readers perspective there would be no indication of how >> the >>> bloom filters were created. The folded versions are identical to having >>> started with that size in the first place. >>> >>>> On Mar 31, 2026, at 10:36 AM, Steve Loughran <[email protected]> >>> wrote: >>>> >>>> Assuming it compresses before writing, you wouldn't be able to tell >> when >>>> you read a file how it was actually created, would you? >>>> >>>> On Tue, 31 Mar 2026 at 00:57, Micah Kornfield <[email protected]> >>> wrote: >>>> >>>>> Hi Adrian, >>>>> Very interesting idea, I don't recall seeing this used in any of the >>>>> reference implementations. On the surface I agree it looks compatible >>> but >>>>> I need to think a little bit more deeply about it. >>>>> >>>>> Cheers, >>>>> Micah >>>>> >>>>> On Mon, Mar 30, 2026 at 3:27 PM Adrian Garcia Badaracco < >>>>> [email protected]> >>>>> wrote: >>>>> >>>>>> I think I've found a neat trick for making smaller bloom filters: >>>>>> https://github.com/apache/arrow-rs/pull/9628 >>>>>> >>>>>> The idea is that you choose a largeish initial bloom filter size and >>> once >>>>>> you're done populating it you compress it by folding it onto itself >> if >>> it >>>>>> is sparse. >>>>>> >>>>>> Does anyone know if this trick is used in any other Parquet >>>>> implementation? >>>>>> As far as I can tell it is compatible with the spec and should cause >> no >>>>>> issues, but I haven't heard of anyone doing this before. >>>>>> >>>>> >>> >>> >>
