Re: [PR] Add PrePlanTable and PlanTable Endpoints to open api spec [iceberg]

via GitHub Wed, 24 Jul 2024 10:38:34 -0700


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1690083456



##########
open-api/rest-catalog-open-api.yaml:
##########
@@ -3647,6 +3818,176 @@ components:
             type: integer
           description: "List of equality field IDs"
 
+    PreplanTableRequest:

Review Comment:
   I spent some time thinking about this, and ultimately I'm not opposed to 
supporting preplan for incremental scans since the "worst case" planning 
scenarios for incremental can still require a preplan:
   
   My thought process:
   
   The semantic of:
   `planTable(startSnapshot, endSnapshot)`
   A server should return a list of file scan tasks which when the client 
performs reads on those scan tasks, it is the the set of changes after 
startSnapshot up until end snapshot.
   So in the example above, planTable(1,3) should return file scan tasks which 
when the client reads represents the changes between 1, 3. In the simplistic 
case, it can just be 2 file scan tasks, 1 for data file B and another for data 
file C.
   
   Now the main question: Are there benefits preplan offers for incremental 
scans that justify the complexity of spec'ing it out?
   
   I'd say in the average case, incremental scans will yield a small set of 
file scan tasks since either a client is either repeatedly hitting the API with 
a small range of snapshots or even if it's a wide range of snapshots, the 
amount of data is probably small enough ([going back to Ryan's comment that 
"because small dimension-like tables are the most 
common."](https://github.com/apache/iceberg/pull/9695/files#r1687228648)), thus 
the metadata for planning the task itself is probably quite small, that it can 
still be handled in plan.
   
   Now in the worst case, is it possible that a client sends an incremental 
scan request with a huge snapshot range (for the append-only case, it's 
equivalent to just planning the latest snapshot where maybe there wasn't any 
compaction of manifests/expiration) OR is it possible that a incremental scan 
is being performed on a table where there's such a large addition of metadata 
between snapshots that it actually does need to be preplanned, otherwise a 
client times out or a server exhausts resources attempting to do the plan? Yeah 
it's possible and in that case a server could determine that an incremental 
scan is actually quite cumbersome (based on added manifests in a given snapshot 
etc), and redirects the client to do a preplan.
   
   These worst case scenarios are technically possible (though arguably 
unlikely because users will need to run maintenance anyways) but I can see the 
argument that it's better to add this now, so the spec is more flexible for 
such cases rather than have to try and add this to the spec/update the endpoint 
version down the line. I'm not super opposed to supporting incremental scan + 
preplan.
   
   But let me know if these "worst case" scenarios I'm talking about can be 
handled in simpler ways, since we really do want to avoid complicating the spec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add PrePlanTable and PlanTable Endpoints to open api spec [iceberg]

Reply via email to