amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1690083456
########## open-api/rest-catalog-open-api.yaml: ########## @@ -3647,6 +3818,176 @@ components: type: integer description: "List of equality field IDs" + PreplanTableRequest: Review Comment: I spent some time thinking about this, and ultimately I'm not opposed to supporting preplan for incremental scans since the "worst case" planning scenarios for incremental can still require a preplan: My thought process: The semantic of: `planTable(startSnapshot, endSnapshot)` A server should return a list of file scan tasks which when the client performs reads on those scan tasks, it is the the set of changes after startSnapshot up until end snapshot. So in the example above, planTable(1,3) should return file scan tasks which when the client reads represents the changes between 1, 3. In the simplistic case, it can just be 2 file scan tasks, 1 for data file B and another for data file C. Now the main question: Are there benefits preplan offers for incremental scans that justify the complexity of spec'ing it out? I'd say in the average case, incremental scans will yield a small set of file scan tasks since either a client is either repeatedly hitting the API with a small range of snapshots or even if it's a wide range of snapshots, the amount of data is probably small enough ([going back to Ryan's comment that "because small dimension-like tables are the most common."](https://github.com/apache/iceberg/pull/9695/files#r1687228648)), thus the metadata for planning the task itself is probably quite small, that it can still be handled in plan. Now in the worst case, is it possible that a client sends an incremental scan request with a huge snapshot range (for the append-only case, it's equivalent to just planning the latest snapshot where maybe there wasn't any compaction of manifests/expiration) OR is it possible that a incremental scan is being performed on a table where there's such a large addition of metadata between snapshots that it actually does need to be preplanned, otherwise a client times out or a server exhausts resources attempting to do the plan? Yeah it's possible and in that case a server could determine that an incremental scan is actually quite cumbersome (based on added manifests in a given snapshot etc), and redirects the client to do a preplan. These worst case scenarios are technically possible (though arguably unlikely because users will need to run maintenance anyways) but I can see the argument that it's better to add this now, so the spec is more flexible for such cases rather than have to try and add this to the spec/update the endpoint version down the line. I'm not super opposed to supporting incremental scan + preplan. But let me know if these "worst case" scenarios I'm talking about can be handled in simpler ways, since we really do want to avoid complicating the spec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org