boroknagyz commented on issue #6709:
URL: https://github.com/apache/iceberg/issues/6709#issuecomment-1413400733

   I think we cannot assume that numRows(table) is equal to numRows(data files) 
- numRows(position delete files).
   
   Because
   
   - Concurent deletes might create delete files that reference the same rows:
   
https://github.com/apache/iceberg/blob/cecb10bb8ab0458fb3f6a650692a8e432f08cbd2/api/src/main/java/org/apache/iceberg/RowDelta.java#L131-L133
   
   - Partial compactions, e.g.:
   
   1. Table has data files: A, B, X and delete file: D
   2. D references A and B
   3. Now we rewrite the small files which are A and X
   4. So now the table has data files AX', B, and delete file D
       (AX' doesn't have any deleted rows)
   5. In this case numRows(table) is not equal to numRows(dataFiles) - 
numRows(deleteFiles)
   
   (Though the above could be fixed by rewriting the delete file to D' to only 
reference rows in B. But I'm unsure if Iceberg does such a thing)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to