sentomk opened a new issue, #685:
URL: https://github.com/apache/iceberg-cpp/issues/685

   **Summary**
   
   `StrictMetricsEvaluator::CanContainNulls` and `CanContainNaNs` incorrectly 
return `false` when the `null_value_counts` / `nan_value_counts` map is 
non-empty but does not contain an entry for the queried field. This causes the 
evaluator to erroneously return `kRowsMustMatch`, potentially skipping 
row-level filtering and returning rows that do not satisfy the predicate.
   
   **Root Cause**
   
   In `src/iceberg/expression/strict_metrics_evaluator.cc`:
   
   ```cpp
   bool CanContainNulls(int32_t id) {
     if (data_file_.null_value_counts.empty()) {
     return true;
     }
     auto it = data_file_.null_value_counts.find(id);
     return it != data_file_.null_value_counts.cend() && it->second > 0; 
     //       ^^^ when field is missing from map, this evaluates to false
   }
   ```
   
   The same pattern exists in CanContainNaNs.
   
   **Reproduction**
   
   ```cpp
     auto data_file = std::make_shared<DataFile>();
     data_file->record_count = 50;
     data_file->value_counts = {{14, 50L}};
     data_file->null_value_counts = {{4, 0L}, {5, 0L}};  // field 14 missing
     data_file->nan_value_counts = {{8, 0L}};             // field 14 missing
     data_file->upper_bounds = {{14, 
Literal::Double(100.0).Serialize().value()}};
     data_file->lower_bounds = {{14, Literal::Double(1.0).Serialize().value()}};
   
     // Evaluating: no_nan_stats < 200.0
     // Expected: kRowsMightNotMatch (null count unknown)
     // Actual:   kRowsMustMatch (incorrectly skips filtering)
   ```
   
   **Proposed Fix**
   
   CanContainNulls: if the field is required per schema, return false; if the 
field is not found in a non-empty map, return true (conservative).
   CanContainNaNs: if the field type is not float/double, return false; if the 
field is not found in a non-empty map, return true (conservative).
   
   This aligns with Java's StrictMetricsEvaluator.canContainNulls() / 
canContainNaNs() which return true when the field is missing from the map.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to