[GitHub] [iceberg] rbalamohan commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

GitBox Sun, 18 Dec 2022 14:59:41 -0800


rbalamohan commented on code in PR #6432:
URL: https://github.com/apache/iceberg/pull/6432#discussion_r1051680991



##########
core/src/main/java/org/apache/iceberg/deletes/Deletes.java:
##########
@@ -144,7 +146,18 @@ public static <T extends StructLike> PositionDeleteIndex 
toPositionIndex(
             deletes ->
                 CloseableIterable.transform(
                     locationFilter.filter(deletes), row -> (Long) 
POSITION_ACCESSOR.get(row)));
-    return toPositionIndex(CloseableIterable.concat(positions));
+    return toPositionIndex(positions);
+  }
+
+  public static PositionDeleteIndex 
toPositionIndex(List<CloseableIterable<Long>> positions) {

Review Comment:
   Thanks @rdblue. Yes, this happens when there are more than one "delete 
positional file" that qualifies for the data file. E.g Assume a trickle feed 
job ingests data into the partition. Due to late arriving data, another job 
updates the dataset for certain dataset in the partition & creates "positional 
files (POS)".  For update jobs with different criteria, same data file may get 
qualified and creates additional POS files.  Essentially during scanning, one 
data file may have to scan multiple POS files (e.g 4 pos files) and causes 
slowness. ParallelIterable helps in this case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] rbalamohan commented on a diff in pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex

Reply via email to