[I] Parquet - Adaptive bloom filter [arrow-go]

via GitHub Sat, 18 Oct 2025 14:27:17 -0700


thomasboelens26 opened a new issue, #525:
URL: https://github.com/apache/arrow-go/issues/525


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I tried using the adaptive bloom filter. Writing a parquet file using the 
correct writer properties seemed to work fine. But when reading that same 
parquet file, the bloom filter did not seem to work. It couldn't find any of 
the values, I know are in the specific column. I tried debugging and came to 
the following conclusions:
   
   version = v18.4.0
   platform = arm mac
   
   I believe I came across 3 issues in parquet/metadata/adaptive_bloom_filter.go
   
   - When adding multiple hashes, using the InsertBulk function, the code does 
not check for duplicates. It only checks whether the hash is already present in 
the largest candidate BloomFilter. So if the new hashes contain multiple 
identical values, that are not yet present in the BF, the numDistinct field 
will wrongfully be increased for each of those hashes.
   - the slices.MinFunc and slices.MaxFunc functions, used to determine the 
largest and the optimal Bloomfilter have a sign error in their comparison func. 
(Should be a-b in stead of b-a)
   - The bloomFilterCandidate struct should contain a pointer to the 
blockSplitBloomFilter. Otherwise the data is already released by the finalizer, 
before it is actually written to file. Maybe the runtime.SetFinalize can also 
be replaced by the addCleanup function.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Parquet - Adaptive bloom filter [arrow-go]

Reply via email to