fallintoplace opened a new issue, #1091:
URL: https://github.com/apache/iceberg-go/issues/1091

   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug
   
   I noticed a case where `strictMetricsEval` may be too optimistic when null 
counts are missing for a nullable field.
   
   The strict evaluator loads metrics directly from the data file:
   
   ```go
   ev.valueCounts, ev.nullCounts = file.ValueCounts(), file.NullValueCounts()
   ev.nanCounts = file.NaNValueCounts()
   ev.lowerBounds, ev.upperBounds = file.LowerBoundValues(), 
file.UpperBoundValues()
   ```
   
   For range/equality predicates, it first checks whether the field may contain 
nulls:
   
   ```go
   if m.canContainNulls(fieldID) || m.canContainNans(fieldID) {
       return rowsMightNotMatch
   }
   ```
   
   But `canContainNulls` returns false when the field is missing from 
`NullValueCounts()`:
   
   ```go
   func (m *strictMetricsEval) canContainNulls(fieldID int) bool {
       cnt, exists := m.nullCounts[fieldID]
   
       return exists && cnt > 0
   }
   ```
   
   For nullable fields, I think a missing null-count entry should be treated as 
unknown rather than as zero nulls.
   
   Example:
   
   ```text
   predicate: id > 0
   
   file metrics:
     lower_bounds[id] = 1
     upper_bounds[id] = 2
     null_value_counts[id] missing
   
   possible rows:
     1
     2
     NULL
   ```
   
   The lower bound proves that all non-null values are greater than zero, but 
it does not prove that there are no nulls. In this case, strict evaluation 
should not conclude that every row matches.
   
   This matters because filtered overwrite/delete uses the strict evaluator to 
decide whether a file can be deleted entirely or needs a partial rewrite:
   
   ```go
   if strict {
       localDelete = append(localDelete, df)
   } else {
       localRewrite = append(localRewrite, df)
   }
   ```
   
   If a nullable file is missing null counts but has bounds, it can be 
classified as a whole-file delete even though null rows should survive the 
filter.
   
   I’d be happy to work on a regression test/fix if this sounds right.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to