Re: Compressing bloom filters

Adrian Garcia Badaracco via dev Wed, 01 Apr 2026 08:19:17 -0700

For the case of DataFusion / arrow-rs the default NDV was already ~ max rows 
per row group. Thus implementing this doesn’t change memory usage at all. I 
wonder what the default NDV is for other implementations.


> On Apr 1, 2026, at 9:29 AM, Andrew Lamb <[email protected]> wrote:
> 
> I just reviewed the PR carefully and I think it is quite clever and
> correct.
> 
> My summary of the feature is that it means writers who want bloom filters
> don't ever have to know / calculate the NDV (number of distinct values) for
> a column. Instead the writer can build a bloom filter and adapt the size to
> match a target false positive probability (FPP)
> 
> The downside is, as Steve points out, that it requires more work / memory
> during writing. However, it does *NOT* require calculating / estimating the
> NDV which I think is a key benefit
> 
> The blog[1] (referenced in the PR) is quite a good read
> 
> [1]:
> https://www.tigerdata.com/blog/blocked-bloom-filters-speeding-up-point-lookups-in-tiger-postgres-native-columnstore
> 
> On Tue, Mar 31, 2026 at 3:17 PM Steve Loughran <[email protected]> wrote:
> 
>> in that case all that matters is the extra memory consumption during writes
>> (not something to ignore...imagine many threads generating files at the
>> same time, and the extra compression delay.
>> 
>> On Tue, 31 Mar 2026 at 17:24, Adrian Garcia Badaracco via dev <
>> [email protected]> wrote:
>> 
>>> I think from a readers perspective there would be no indication of how
>> the
>>> bloom filters were created. The folded versions are identical to having
>>> started with that size in the first place.
>>> 
>>>> On Mar 31, 2026, at 10:36 AM, Steve Loughran <[email protected]>
>>> wrote:
>>>> 
>>>> Assuming it compresses before writing, you wouldn't be able to tell
>> when
>>>> you read a file how it was actually created, would you?
>>>> 
>>>> On Tue, 31 Mar 2026 at 00:57, Micah Kornfield <[email protected]>
>>> wrote:
>>>> 
>>>>> Hi Adrian,
>>>>> Very interesting idea, I don't recall seeing this used in any of the
>>>>> reference implementations.  On the surface I agree it looks compatible
>>> but
>>>>> I need to think a little bit more deeply about it.
>>>>> 
>>>>> Cheers,
>>>>> Micah
>>>>> 
>>>>> On Mon, Mar 30, 2026 at 3:27 PM Adrian Garcia Badaracco <
>>>>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> I think I've found a neat trick for making smaller bloom filters:
>>>>>> https://github.com/apache/arrow-rs/pull/9628
>>>>>> 
>>>>>> The idea is that you choose a largeish initial bloom filter size and
>>> once
>>>>>> you're done populating it you compress it by folding it onto itself
>> if
>>> it
>>>>>> is sparse.
>>>>>> 
>>>>>> Does anyone know if this trick is used in any other Parquet
>>>>> implementation?
>>>>>> As far as I can tell it is compatible with the spec and should cause
>> no
>>>>>> issues, but I haven't heard of anyone doing this before.
>>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Compressing bloom filters

Reply via email to