cccs-jc commented on issue #10029:
URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2016525638

   @huaxingao You are absolutely correct; the issue arises also when combining 
the `statsFilter` with the `dictFilter`. It's essentially the same underlying 
problem.
   
   The crux of the issue lies in the evaluation of conditions such as `col1=1 
OR col2=1`, where we're layering the complete evaluation performed by 
`statsFilter`, with the evaluation from `dictFilter`, with the evaluation of 
`bloomFilter`.
   
   Ideally, for every column, we would want to combine the information from 
stats, dict and bloom. This sounds like a "Chain of Responsibility."
   
   So instead of the current implementation:
   
   ```java
   statsFilter = new ParquetMetricsRowGroupFilter(expectedSchema, filter, 
caseSensitive);
   dictFilter = new ParquetDictionaryRowGroupFilter(expectedSchema, filter, 
caseSensitive);
   bloomFilter = new ParquetBloomRowGroupFilter(expectedSchema, filter, 
caseSensitive);
   
   boolean shouldRead =
             filter == null
                 || (statsFilter.shouldRead(typeWithIds, rowGroup)
                     && dictFilter.shouldRead(
                         typeWithIds, rowGroup, 
reader.getDictionaryReader(rowGroup))
                     && bloomFilter.shouldRead(
                         typeWithIds, rowGroup, 
reader.getBloomFilterDataReader(rowGroup)));
   ```
   
   I propose the following:
   
   ```java
   statsFilter = new ParquetMetricsRowGroupFilter(expectedSchema, filter, 
caseSensitive);
   dictFilter = new ParquetDictionaryRowGroupFilter(expectedSchema, filter, 
caseSensitive);
   bloomFilter = new ParquetBloomRowGroupFilter(expectedSchema, filter, 
caseSensitive);
   dictFilter.setDict(reader.getDictionaryReader(rowGroup));
   bloomFilter.setBloom(reader.getBloomFilterDataReader(rowGroup));
                                          
   // Chain statsFilter -> bloomFilter -> dictFilter
   bloomFilter.setNext(dictFilter);
   statsFilter.setNext(bloomFilter);
   
   boolean shouldRead =
             filter == null
                 || statsFilter.shouldRead(typeWithIds, rowGroup);
   ```
   
   In this setup, when the `statsFilter` evaluates an expression like `eq` and 
determines that `ROWS_CANNOT_MATCH`, it returns immediately. However, if it 
determines `ROWS_MIGHT_MATCH`, it passes on the request to the next filter 
(`bloomFilter`). The `bloomFilter` then tests against the bloom information. If 
it determines `ROWS_CANNOT_MATCH`, it returns immediately. But if it determines 
`ROWS_MIGHT_MATCH`, it then asks the next filter (`dictFilter`).
   
   This approach effectively combines all the information (metrics, bloom, 
dict) for a given `col1` and accurately determines whether `ROWS_CANNOT_MATCH` 
or `ROWS_MIGHT_MATCH` for `col1=1`.
   
   Basically the chain is like this:  `statsFilter.eq(col1=1) -> 
bloomFilter.eq(col1=1) -> dictFilter.eq(col1=1)`
   
   It does the same with `col2=1`.
   
   Finally, the results for every column (`col1`, `col2`) can be combined using 
the `public Boolean or(Boolean leftResult, Boolean rightResult)` method. The 
`or` method does not need to be delegated since it does not tests against 
metrics, dict or bloom.
   
   Does this seem reasonable?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to