Re: [PR] API, Core: Add scan planning api models and parsers [iceberg]

via GitHub Wed, 09 Oct 2024 18:54:23 -0700


amogh-jahagirdar commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1794452438



##########
core/src/main/java/org/apache/iceberg/GenericDataFile.java:
##########
@@ -26,7 +26,7 @@
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
 import org.apache.iceberg.types.Types;
 
-class GenericDataFile extends BaseFile<DataFile> implements DataFile {
+public class GenericDataFile extends BaseFile<DataFile> implements DataFile {

Review Comment:
   This and GenericDeleteFile shouldn't be made public, based on discussion 
with @rahil-c this is just an artifact of having the parser implementation in 
the REST package which can't access this class. I think the real solution 
should be to uplevel the new parser to core + reuse as much of the existing 
Content file parsers



##########
core/src/main/java/org/apache/iceberg/PartitionData.java:
##########
@@ -72,6 +72,17 @@ public PartitionData(Types.StructType partitionType) {
     this.stringSchema = schema.toString();
   }
 
+  public PartitionData() {
+    // This construtor has niche case of being used at time for rest 
serial/deserilization
+    // where we do not have the partition spec and can not populate these 
values.
+    // these values will be refreshed before returning to client engine
+    this.partitionType = null;
+    this.size = 0;
+    this.data = new Object[size];
+    this.schema = null;
+    this.stringSchema = null;
+  }

Review Comment:
   Discussed with @rahil-c technically we could avoid this constructor by 
passing in a null value to the GenericDataFile and rebinding that later, but 
the bigger question is if we want the server to just send back a serialized 
spec 



##########
core/src/main/java/org/apache/iceberg/rest/requests/PlanTableScanRequest.java:
##########
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest.requests;
+
+import java.util.List;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.rest.RESTRequest;
+
+public class PlanTableScanRequest implements RESTRequest {
+  private Long snapshotId;
+  private List<String> select;
+  private Expression filter;
+  private Boolean caseSensitive;
+  private Boolean useSnapshotSchema;
+  private Long startSnapshotId;
+  private Long endSnapshotId;
+  private List<String> statsFields;
+
+  public Long snapshotId() {
+    return snapshotId;
+  }
+
+  public List<String> select() {
+    return select;
+  }
+
+  public Expression filter() {
+    return filter;
+  }
+
+  public Boolean caseSensitive() {
+    return caseSensitive;
+  }
+
+  public Boolean useSnapshotSchema() {
+    return useSnapshotSchema;
+  }

Review Comment:
   I think these should always be `boolean`. From a model perspective I think 
you'll want these methods to return some value especially since we have 
defaults defined. This will also simplify the parser logic where you don't have 
to if/else these cases



##########
core/src/main/java/org/apache/iceberg/rest/requests/PlanTableScanRequestParser.java:
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest.requests;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.util.List;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.ExpressionParser;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class PlanTableScanRequestParser {
+  private static final String SNAPSHOT_ID = "snapshot-id";
+  private static final String SELECT = "select";
+  private static final String FILTER = "filter";
+  private static final String CASE_SENSITIVE = "case-sensitive";
+  private static final String USE_SNAPSHOT_SCHEMA = "use-snapshot-schema";
+  private static final String START_SNAPSHOT_ID = "start-snapshot-id";
+  private static final String END_SNAPSHOT_ID = "end-snapshot-id";
+  private static final String STATS_FIELDS = "stats-fields";
+
+  private PlanTableScanRequestParser() {}
+
+  public static String toJson(PlanTableScanRequest request) {
+    return toJson(request, false);
+  }
+
+  public static String toJson(PlanTableScanRequest request, boolean pretty) {
+    return JsonUtil.generate(gen -> toJson(request, gen), pretty);
+  }
+
+  public static void toJson(PlanTableScanRequest request, JsonGenerator gen) 
throws IOException {
+    Preconditions.checkArgument(null != request, "Invalid request: 
planTableScanRequest null");
+    gen.writeStartObject();
+    if (request.snapshotId() != null) {
+      gen.writeNumberField(SNAPSHOT_ID, request.snapshotId());
+    }
+    if (request.select() != null && !request.select().isEmpty()) {
+      JsonUtil.writeStringArray(SELECT, request.select(), gen);
+    }
+    if (request.filter() != null) {
+      gen.writeStringField(FILTER, ExpressionParser.toJson(request.filter()));
+    }
+    if (request.caseSensitive() != null) {
+      gen.writeBooleanField(CASE_SENSITIVE, request.caseSensitive());
+    } else {
+      gen.writeBooleanField(CASE_SENSITIVE, true);
+    }
+    if (request.useSnapshotSchema() != null) {
+      gen.writeBooleanField(USE_SNAPSHOT_SCHEMA, request.useSnapshotSchema());
+    } else {
+      gen.writeBooleanField(USE_SNAPSHOT_SCHEMA, false);
+    }

Review Comment:
   See my comment above, I think we should write the model in the library in a 
manner where request.useSnapshotSchema() always returns a concrete boolean 
value and can't be null. 
   
   The spec says it's optional but that doesn't mean the reference 
implementation for the spec should enable that behavior for the in memory 
representation for these structures. The spec just gives that flexibility to 
clients and defines what the server should do in case these boolean fields are 
missing.



##########
core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java:
##########
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.ContentFile;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileContent;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.GenericDataFile;
+import org.apache.iceberg.GenericDeleteFile;
+import org.apache.iceberg.Metrics;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.SingleValueParser;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RESTContentFileParser {
+  private static final String SPEC_ID = "spec-id";
+  private static final String CONTENT = "content";
+  private static final String FILE_PATH = "file-path";
+  private static final String FILE_FORMAT = "file-format";
+  private static final String PARTITION = "partition";
+  private static final String RECORD_COUNT = "record-count";
+  private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes";
+  private static final String COLUMN_SIZES = "column-sizes";
+  private static final String VALUE_COUNTS = "value-counts";
+  private static final String NULL_VALUE_COUNTS = "null-value-counts";
+  private static final String NAN_VALUE_COUNTS = "nan-value-counts";
+  private static final String LOWER_BOUNDS = "lower-bounds";
+  private static final String UPPER_BOUNDS = "upper-bounds";
+  private static final String KEY_METADATA = "key-metadata";
+  private static final String SPLIT_OFFSETS = "split-offsets";
+  private static final String EQUALITY_IDS = "equality-ids";
+  private static final String SORT_ORDER_ID = "sort-order-id";
+
+  private RESTContentFileParser() {}
+
+  public static String toJson(ContentFile<?> contentFile) {
+    return JsonUtil.generate(
+        generator -> RESTContentFileParser.toJson(contentFile, generator), 
false);
+  }
+
+  public static void toJson(ContentFile<?> contentFile, JsonGenerator 
generator)
+      throws IOException {
+    Preconditions.checkArgument(contentFile != null, "Invalid content file: 
null");
+    Preconditions.checkArgument(generator != null, "Invalid JSON generator: 
null");
+
+    generator.writeStartObject();
+
+    generator.writeNumberField(SPEC_ID, contentFile.specId());
+    generator.writeStringField(CONTENT, contentFile.content().name());
+    generator.writeStringField(FILE_PATH, contentFile.path().toString());
+    generator.writeStringField(FILE_FORMAT, contentFile.format().name());
+
+    generator.writeFieldName(PARTITION);
+
+    // TODO at the time of serialization we dont have the partition spec we 
just have spec id.
+    // we will need to get the spec from table metadata using spec id.
+    // or we will need to send parition spec, put null here for now until 
refresh
+    SingleValueParser.toJson(null, contentFile.partition(), generator);
+
+    generator.writeNumberField(FILE_SIZE_IN_BYTES, 
contentFile.fileSizeInBytes());
+
+    metricsToJson(contentFile, generator);
+
+    if (contentFile.keyMetadata() != null) {
+      generator.writeFieldName(KEY_METADATA);
+      SingleValueParser.toJson(DataFile.KEY_METADATA.type(), 
contentFile.keyMetadata(), generator);
+    }
+
+    if (contentFile.splitOffsets() != null) {
+      JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), 
generator);
+    }
+
+    if (contentFile.equalityFieldIds() != null) {
+      JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), 
generator);
+    }
+
+    if (contentFile.sortOrderId() != null) {
+      generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId());
+    }
+
+    generator.writeEndObject();
+  }
+
+  public static ContentFile<?> fromJson(JsonNode jsonNode) {
+    Preconditions.checkArgument(jsonNode != null, "Invalid JSON node for 
content file: null");
+    Preconditions.checkArgument(
+        jsonNode.isObject(), "Invalid JSON node for content file: non-object 
(%s)", jsonNode);
+
+    int specId = JsonUtil.getInt(SPEC_ID, jsonNode);
+    FileContent fileContent = FileContent.valueOf(JsonUtil.getString(CONTENT, 
jsonNode));
+    String filePath = JsonUtil.getString(FILE_PATH, jsonNode);
+    FileFormat fileFormat = 
FileFormat.fromString(JsonUtil.getString(FILE_FORMAT, jsonNode));
+
+    // TODO at the time of deserialization we dont have the partition spec we 
just have spec id.
+    // we will need to get the spec from table metadata using spec id.
+    // we wil need to send parition spec i believe, put some placeholder here 
for now null

Review Comment:
   Yeah I think there's a compelling case to update the rest spec so we send 
back the full serialized spec .In case we were worried about payload sizes 
going over the wire we could follow the same concept we do for delete file 
references where there's a top level list where it's only listed once, and in 
the tasks itself it's just an integer index. But this complicates the protocol.
   
    I think payload size is more of a concern for sending schemas, which I'm 
pretty against. I feel like resolving schemas is probably worth it on the 
client. Besides clients have full context on what's being projected and can 
just send the right schema to the file scan task. There's also an argument for 
the server sending back a schema ID so that the client doesn't have to do the 
work of resolving a schema to use for passing to the file scan task, but the 
schema in FileScanTask is ultimately for any client side grouping the client 
wants to do and we could just use the current schema or the snapshot schema.
   



##########
core/src/main/java/org/apache/iceberg/rest/requests/PlanTableScanRequestParser.java:
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest.requests;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.util.List;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.ExpressionParser;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class PlanTableScanRequestParser {
+  private static final String SNAPSHOT_ID = "snapshot-id";
+  private static final String SELECT = "select";
+  private static final String FILTER = "filter";
+  private static final String CASE_SENSITIVE = "case-sensitive";
+  private static final String USE_SNAPSHOT_SCHEMA = "use-snapshot-schema";
+  private static final String START_SNAPSHOT_ID = "start-snapshot-id";
+  private static final String END_SNAPSHOT_ID = "end-snapshot-id";
+  private static final String STATS_FIELDS = "stats-fields";
+
+  private PlanTableScanRequestParser() {}
+
+  public static String toJson(PlanTableScanRequest request) {
+    return toJson(request, false);
+  }
+
+  public static String toJson(PlanTableScanRequest request, boolean pretty) {
+    return JsonUtil.generate(gen -> toJson(request, gen), pretty);
+  }
+
+  public static void toJson(PlanTableScanRequest request, JsonGenerator gen) 
throws IOException {
+    Preconditions.checkArgument(null != request, "Invalid request: 
planTableScanRequest null");
+    gen.writeStartObject();
+    if (request.snapshotId() != null) {
+      gen.writeNumberField(SNAPSHOT_ID, request.snapshotId());
+    }
+    if (request.select() != null && !request.select().isEmpty()) {
+      JsonUtil.writeStringArray(SELECT, request.select(), gen);
+    }
+    if (request.filter() != null) {
+      gen.writeStringField(FILTER, ExpressionParser.toJson(request.filter()));
+    }
+    if (request.caseSensitive() != null) {
+      gen.writeBooleanField(CASE_SENSITIVE, request.caseSensitive());
+    } else {
+      gen.writeBooleanField(CASE_SENSITIVE, true);
+    }
+    if (request.useSnapshotSchema() != null) {
+      gen.writeBooleanField(USE_SNAPSHOT_SCHEMA, request.useSnapshotSchema());
+    } else {
+      gen.writeBooleanField(USE_SNAPSHOT_SCHEMA, false);
+    }

Review Comment:
   Also newlines after if blocks please. I think this is a case where it's 
really helpful for readability.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core: Add scan planning api models and parsers [iceberg]

Reply via email to