amogh-jahagirdar commented on code in PR #14480:
URL: https://github.com/apache/iceberg/pull/14480#discussion_r2495484927


##########
core/src/test/java/org/apache/iceberg/rest/RESTCatalogAdapter.java:
##########
@@ -429,6 +562,68 @@ public <T extends RESTResponse> T handleRequest(
     return null;
   }
 
+  /**
+   * Do all the planning upfront but batch the file scan tasks across plan 
tasks. Plan Tasks have a
+   * key like <plan ID - table UUID - plan task sequence> The current 
implementation simply uses
+   * plan tasks as a pagination mechanism to control response sizes.
+   *
+   * @param tableScan
+   * @param planId
+   */
+  private void planFilesFor(TableScan tableScan, String planId) {
+    Iterable<List<FileScanTask>> taskGroupings =
+        Iterables.partition(
+            tableScan.planFiles(), 
planningBehavior.numberFileScanTasksPerPlanTask());

Review Comment:
   `planFiles` already does have 1 data file per file scan task. If you trace 
through 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestGroup.java#L172,
 
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/ManifestGroup.java#L368
 Let me know if I misinterpreted your question. 
   
   In client side planning that happens today we'll first do a planFiles, and 
then we'll do the split you're referring to (either using the explicit offsets 
in metadata if applicable or the configured split size), and then combine those 
splits into different task groupings. 
   
   >we need to handle it in rest spec too the split scan task 
   
   I'm not quite sure that we _need_ to but I do think it could be a useful 
extension to the protocol to allow the server to send back byte ranges to read 
in a file. I could imagine server side, there could be a footer/metadata cache 
of some sort that allows the server to more quickly prune out row group ranges 
to skip.
   
   For this implementation though, as it's mostly just for testing I'd probably 
keep it really simple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to