Re: [PR] Add PrePlanTable and PlanTable Endpoints to open api spec [iceberg]

via GitHub Mon, 22 Jul 2024 14:57:06 -0700


rdblue commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1687182174



##########
open-api/rest-catalog-open-api.yaml:
##########
@@ -3642,6 +3781,173 @@ components:
             type: integer
           description: "List of equality field IDs"
 
+    PreplanTableRequest:
+      type: object
+      required:
+        - table-scan-context
+      properties:
+        table-scan-context:
+          $ref: '#/components/schemas/TableScanContext'
+
+    PlanTableRequest:
+      type: object
+      required:
+        - table-scan-context
+      properties:
+        table-scan-context:
+          $ref: '#/components/schemas/TableScanContext'
+        plan-task:
+          $ref: '#/components/schemas/PlanTask'
+        stats-fields:
+          description:
+            A list of fields that the client requests the server to send 
statistics
+            in each `FileScanTask` returned in the response
+          type: array
+          items:
+            $ref: '#/components/schemas/FieldName'
+
+    TableScanContext:
+      anyOf:
+        - $ref: '#/components/schemas/SnapshotScanContext'
+        - $ref: '#/components/schemas/IncrementalSnapshotScanContext'
+
+    BaseTableScanContext:
+      discriminator:
+        propertyName: table-scan-type
+        mapping:
+          snapshot-scan: '#/components/schemas/SnapshotScanContext'
+          incremental-snapshot-scan: 
'#/components/schemas/IncrementalSnapshotScanContext'
+      type: object
+      required:
+        - table-scan-type
+      properties:
+        table-scan-type:
+          type: string
+
+    SnapshotScanContext:
+      description: context for scanning data in a specific snapshot
+      type: object
+      allOf:
+        - $ref: '#/components/schemas/BaseTableScanContext'
+      required:
+        - table-scan-type
+      properties:
+        table-scan-type:
+          type: string
+          enum: ["snapshot-scan"]
+        select:
+          $ref: '#/components/schemas/SelectedFieldNames'
+        filter:
+          $ref: '#/components/schemas/Filter'
+        case-sensitive:
+          description: If field selection and filtering should be case 
sensitive
+          type: boolean
+          default: true
+        snapshot-id:
+          description:
+            The ID of the snapshot to use for the table scan.
+            If not specified, the snapshot at the main branch head will be 
used.
+          type: integer
+          format: int64
+        use-snapshot-schema:
+          description:
+            If the schema of the specific snapshot should be used instead of 
the table schema.
+          type: boolean
+          default: false
+
+    IncrementalSnapshotScanContext:
+      description:
+        Context for scanning data appended in a range of snapshots.
+        The scan always follows the schema of the snapshot at the main branch 
head.
+      type: object
+      allOf:
+        - $ref: '#/components/schemas/BaseTableScanContext'
+      required:
+        - table-scan-type
+        - start-snapshot-id
+      properties:
+        table-scan-type:
+          type: string
+          enum: ["incremental-snapshot-scan"]
+        select:
+          $ref: '#/components/schemas/SelectedFieldNames'
+        filter:
+          $ref: '#/components/schemas/Filter'
+        case-sensitive:
+          description: If field selection and filtering should be case 
sensitive
+          type: boolean
+          default: true
+        start-snapshot-id:
+          description: The ID of the starting snapshot of the incremental scan
+          type: integer
+          format: int64
+        inclusive-start:
+          description: If the data appended in the start snapshot should be 
included in the scan
+          type: boolean
+          default: false
+        end-snapshot-id:
+          description:
+            The ID of the inclusive ending snapshot of the incremental scan.
+            If not specified, the snapshot at the main branch head will be 
used as the end snapshot.
+          type: integer
+          format: int64
+
+    FieldName:

Review Comment:
   @syun64, @jackye1995, @amogh-jahagirdar, flattening with `.` is not a 
problem for Iceberg tables because ambiguous names like `a.b` in the example 
type that Sung posted are not allowed.
   
   This is how `Schema` has worked from very early on ([tests added in 
2019](https://github.com/apache/iceberg/commit/8b41ee5e34b5d79e7a04b3c810d7073cfed01a3a))
 and why `findField` has always used the name to ID index.
   
   The rationale for this is that there are two ways to handle conflicts like 
this. First, you could use escaping and have extra complexity when parsing 
structures. Or second, you can disallow ambiguous names so that you always have 
a clear mapping from the flattened string. Iceberg opted for the second option 
for column names so that there is no need for escaping.
   
   This has really helped simplify a lot of APIs! For instance, 
`Schema.select(String...)` can select multiple columns easily rather than 
needing to have a complicated signature like `Schema.select(List<String>...)`.
   
   A couple more things to clarify:
   
   > Flattening the name actually contrasts our approach documented in Column 
Projection section of the docs, where we note that a name may contain '.' but 
that this refers to a literal name, and not a nested field
   
   This actually does not conflict. This section is documenting what the field 
name in a column mapping means, and this is saying that the name string 
identifies an immediate child and not a further nested child. This is because 
we do not index sub-sections of these tree structures. Iceberg only indexes 
fields by name in `Schema` and not in each `StructType`, for example.
   
   > I think the reason Namespace has a principle on flat representation is 
because there's a GET endpoint that requires us to use a flat representation.
   
   Actually, I originally advocated for the same approach for namespaces in 
URLs. I think we should have used `.` as the delimiter because it is easier to 
work with. The trade-off is that REST catalogs would not be able to support 
ambiguous names, like `["a", "b"]` in the same catalog as `["a.b"]`.
   
   That's not what the community decided to go forward with, so we have escape 
characters when sending namespaces in URLs. But also note that namespaces in 
the REST spec use a JSON array of strings whenever that is possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add PrePlanTable and PlanTable Endpoints to open api spec [iceberg]

Reply via email to