rdblue commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1687228648
########## open-api/rest-catalog-open-api.yaml: ########## @@ -537,6 +537,113 @@ paths: 5XX: $ref: '#/components/responses/ServerErrorResponse' + /v1/{prefix}/namespaces/{namespace}/tables/{table}/preplan: Review Comment: I've also been thinking about how to handle this, and I think what is more important is how define the default behavior for plan/preplan. The `plan` endpoint is always required because that's the only way to get the list of `FileScanTask` to process. The `preplan` endpoint can be very simple -- even as simple as always returning the exact same opaque plan task for all requests. In other words, it would make no sense for `preplan` to return 501 because it is just as easy to fake a plan task and have the `plan` endpoint do all the real work. My conclusion is that it isn't worth a separate capability for preplanning and everything should support it. That's why I think the more important choice is whether `plan` or `preplan` is called first. For default behavior, there are two scenarios: 1. If the client calls `plan` without calling `preplan` 2. If the client calls `preplan` and then `plan` In scenario 1, the client calls `plan` and the service can decide whether to redirect it to `preplan` using 521. If the service doesn't use `preplan` or the query is small then it returns file scan tasks to the client in a single request (fastest). If the service does want to use `preplan` for a large scan then it can quickly return 521 to redirect the client to `preplan`. That `preplan` fallback case takes at least 3 requests (521 from `plan`, `preplan`, and `plan` with a plan task) but it is likely that there are more requests because of the size of the scan. In scenario 2, the client always calls `preplan` and is sent plan tasks, then calls `plan` one or more times. There are always at least 2 requests but there are probably more for large scans. I would choose scenario 1 for the default because in the average case I think it will perform better. I think most of the time the service will return the file scan tasks in a single request because small dimension-like tables are the most common. And when there is 1 extra request to plan a large scan, performance will be dominated by metadata scanning time anyway. If we do go with a default to call the `plan` endpoint first, then that is more of a reason not to add a separate capability for `preplan` because the `plan` endpoint can choose to never redirect the client. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org