thomasboelens26 opened a new issue, #525: URL: https://github.com/apache/arrow-go/issues/525
### Describe the bug, including details regarding any error messages, version, and platform. I tried using the adaptive bloom filter. Writing a parquet file using the correct writer properties seemed to work fine. But when reading that same parquet file, the bloom filter did not seem to work. It couldn't find any of the values, I know are in the specific column. I tried debugging and came to the following conclusions: version = v18.4.0 platform = arm mac I believe I came across 3 issues in parquet/metadata/adaptive_bloom_filter.go - When adding multiple hashes, using the InsertBulk function, the code does not check for duplicates. It only checks whether the hash is already present in the largest candidate BloomFilter. So if the new hashes contain multiple identical values, that are not yet present in the BF, the numDistinct field will wrongfully be increased for each of those hashes. - the slices.MinFunc and slices.MaxFunc functions, used to determine the largest and the optimal Bloomfilter have a sign error in their comparison func. (Should be a-b in stead of b-a) - The bloomFilterCandidate struct should contain a pointer to the blockSplitBloomFilter. Otherwise the data is already released by the finalizer, before it is actually written to file. Maybe the runtime.SetFinalize can also be replaced by the addCleanup function. ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
