simhadri-g opened a new issue, #6709:
URL: https://github.com/apache/iceberg/issues/6709

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Hive
   
   ### Please describe the bug 🐞
   
   **Issue:**
   
   - Hive planner uses basic stats for query planning. 
   - It obtains the total ROW_COUNT  during a tableScan  from 
SnapshotSummary.TOTAL_RECORDS_PROP .
   -  
https://github.com/apache/hive/blob/master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java#L323
   - But we noticed that **after deleting records**,  the 
SnapshotSummary.TOTAL_RECORDS_PROP  does not change. Only inserting new records 
updates this value.
   - This has quite a bit impact on the Hive planner. 
   - If a table had 1 million rows and 900k rows were deleted, hive planner 
still sees that the total number of records is 1 million. This is problematic 
during a Join query because hive depends on the total row count to determine 
the join strategy(map join vs merge join and map join being significantly 
faster for smaller tables).  The incorrect row count throws off the hive 
planner and forces it to use the slower merge join.
   
   
   
   - I also noticed that **Apache spark** uses the same 
SnapshotSummary.TOTAL_RECORDS_PROP to obtain the total record count.
   ` long totalRecords =
             PropertyUtil.propertyAsLong(
                 snapshot.summary(), SnapshotSummary.TOTAL_RECORDS_PROP, 
Long.MAX_VALUE);`
   - I may be incorrect,  but I think spark is also affected by the same issue.
   
https://github.com/apache/iceberg/blob/a301a1f098b56babf5d2329b2d1f7bb9af9eb832/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L149
 
   
   **My doubts are**
   
   -  Should we be taking the difference between  TOTAL_RECORDS_PROP and 
TOTAL_POS_DELETES_PROP as below to get the total row count (assuming there are 
no equality deletes )?:
   `long totalRecords = 
Long.parseLong(summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)) -
                 
Long.parseLong(summary.get(SnapshotSummary.TOTAL_POS_DELETES_PROP));`
   
   - Is there a way to know the total number of rows deleted during equality 
deletes and obtain the correct row count?
   
   
   I would be most grateful if someone could answer my questions. 
   
   I apologize in advance if my understanding is lacking.
   
   Thanks! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to