rdblue commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1687228648


##########
open-api/rest-catalog-open-api.yaml:
##########
@@ -537,6 +537,113 @@ paths:
         5XX:
           $ref: '#/components/responses/ServerErrorResponse'
 
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/preplan:

Review Comment:
   I've also been thinking about how to handle this, and I think what is more 
important is how define the default behavior for plan/preplan.
   
   The `plan` endpoint is always required because that's the only way to get 
the list of `FileScanTask` to process. The `preplan` endpoint can be very 
simple -- even as simple as always returning the exact same opaque plan task 
for all requests. In other words, it would make no sense for `preplan` to 
return 501 because it is just as easy to fake a plan task and have the `plan` 
endpoint do all the real work.
   
   My conclusion is that it isn't worth a separate capability for preplanning 
and everything should support it.
   
   That's why I think the more important choice is whether `plan` or `preplan` 
is called first. For default behavior, there are two scenarios:
   1. If the client calls `plan` without calling `preplan`
   2. If the client calls `preplan` and then `plan`
   
   In scenario 1, the client calls `plan` and the service can decide whether to 
redirect it to `preplan` using 521. If the service doesn't use `preplan` or the 
query is small then it returns file scan tasks to the client in a single 
request (fastest). If the service does want to use `preplan` for a large scan 
then it can quickly return 521 to redirect the client to `preplan`. That 
`preplan` fallback case takes at least 3 requests (521 from `plan`, `preplan`, 
and `plan` with a plan task) but it is likely that there are more requests 
because of the size of the scan.
   
   In scenario 2, the client always calls `preplan` and is sent plan tasks, 
then calls `plan` one or more times. There are always at least 2 requests but 
there are probably more for large scans.
   
   I would choose scenario 1 for the default because in the average case I 
think it will perform better. I think most of the time the service will return 
the file scan tasks in a single request because small dimension-like tables are 
the most common. And when there is 1 extra request to plan a large scan, 
performance will be dominated by metadata scanning time anyway.
   
   If we do go with a default to call the `plan` endpoint first, then that is 
more of a reason not to add a separate capability for `preplan` because the 
`plan` endpoint can choose to never redirect the client.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to