rdblue commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1494914340


##########
open-api/rest-catalog-open-api.yaml:
##########
@@ -2800,6 +3053,59 @@ components:
           additionalProperties:
             type: string
 
+    PlanContext:
+      type: object
+      required:
+        - select
+        - options
+      properties:
+        select:
+          description: A list of the selected columns
+          type: array
+          items:
+            type: string
+        project:

Review Comment:
   It looks like this simply copies the Java API. That's not what we want to do 
when designing a REST endpoint because it can end up overly complicated. In the 
Java API we have some options for the caller's convenience, but those don't 
make sense if translated to an API contract like this.
   
   The difference between `select` and `project` is a good example. In Java, 
both are exposed for different use cases. If the caller has a list of columns 
from SQL, we can accept those via `select`. And if the caller has a schema 
object we can accept that through `project`. I'm not sure that we want to do 
either one in the REST API, but I'm confident that we do not want to expose 
multiple. Remember that making this more complicated makes it harder to 
implement correctly!
   
   These endpoints should expose a minimal set of options. I think that should 
be:
   * `select`: a list of string field names to project, using full names joined 
by `.` that can be passed to `Schema.findField(String)`
   * `filter`: an _unbound_ filter expression
   * `case-sensitive`: a boolean indicating whether column selection and 
filtering should be case sensitive
   * `snapshot-id`: an int64 snapshot ID (if `snapshot-range` is not present); 
optional and defaults to the table's current snapshot
   * `snapshot-range`: a JSON list containing exactly 2 int64 snapshot IDs (if 
`snapshot-id` is not present) representing the start (exclusive) and end 
(inclusive) snapshots
   * `stats-fields`: a list of string field names for which stats should be 
included, using full names joined by `.` that can be passed to 
`Schema.findField(String)`; optional and defaults to none
   
   I've made a couple of important choices to come up with this list:
   1. This assumes that the caller has access to the table metadata in order to 
read branches or time travel. The caller is responsible for choosing a snapshot 
based on a requested tag/branch name or as-of timestamp.
   2. This uses unbound filters and field names to make it easier to craft 
requests. The alternative is to use field IDs throughout, but expression 
binding still needs to happen in the REST service (we can't trust expressions 
are bound correctly) and would be harder to read and create requests. Using IDs 
would avoid exposing metadata (field names) in requests, but I think the 
trade-off is worth it.
   3. The `snapshot-range` is opinionated about start and end boundary 
conditions, but these align with normal expectations. Another option is to use 
`start-snapshot-id` and `end-snapshot-id`, which would allow us to omit the end 
and use the current table state. (@jackye1995, what do you think?)
   4. Options have been omitted. Those are used for split planning parameters, 
which are not part of this API. Breaking the files into splits is done on the 
client side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to