simhadri-g opened a new issue, #6709: URL: https://github.com/apache/iceberg/issues/6709
### Apache Iceberg version 1.1.0 (latest release) ### Query engine Hive ### Please describe the bug 🐞 **Issue:** - Hive planner uses basic stats for query planning. - It obtains the total ROW_COUNT during a tableScan from SnapshotSummary.TOTAL_RECORDS_PROP . - https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L323 - But we noticed that **after deleting records**, the SnapshotSummary.TOTAL_RECORDS_PROP does not change. Only inserting new records updates this value. - This has quite a bit impact on the Hive planner. - If a table had 1 million rows and 900k rows were deleted, hive planner still sees that the total number of records is 1 million. This is problematic during a Join query because hive depends on the total row count to determine the join strategy(map join vs merge join and map join being significantly faster for smaller tables). The incorrect row count throws off the hive planner and forces it to use the slower merge join. - I also noticed that **Apache spark** uses the same SnapshotSummary.TOTAL_RECORDS_PROP to obtain the total record count. ` long totalRecords = PropertyUtil.propertyAsLong( snapshot.summary(), SnapshotSummary.TOTAL_RECORDS_PROP, Long.MAX_VALUE);` - I may be incorrect, but I think spark is also affected by the same issue. https://github.com/apache/iceberg/blob/a301a1f098b56babf5d2329b2d1f7bb9af9eb832/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L149 **My doubts are** - Should we be taking the difference between TOTAL_RECORDS_PROP and TOTAL_POS_DELETES_PROP as below to get the total row count (assuming there are no equality deletes )?: `long totalRecords = Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)) - Long.parseLong(summary.get(SnapshotSummary.TOTAL_POS_DELETES_PROP));` - Is there a way to know the total number of rows deleted during equality deletes and obtain the correct row count? I would be most grateful if someone could answer my questions. I apologize in advance if my understanding is lacking. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org