Re: [PR] Spec, Core: add a new task-type field to task JSON serialization. add data task JSON serialization spec and imp. [iceberg]

via GitHub Fri, 16 Feb 2024 10:02:23 -0800


jackye1995 commented on code in PR #9728:
URL: https://github.com/apache/iceberg/pull/9728#discussion_r1492816128



##########
core/src/main/java/org/apache/iceberg/FileScanTaskParser.java:
##########
@@ -21,116 +21,84 @@
 import com.fasterxml.jackson.core.JsonGenerator;
 import com.fasterxml.jackson.databind.JsonNode;
 import java.io.IOException;
-import org.apache.iceberg.expressions.Expression;
-import org.apache.iceberg.expressions.ExpressionParser;
-import org.apache.iceberg.expressions.Expressions;
-import org.apache.iceberg.expressions.ResidualEvaluator;
 import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
-import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.base.Strings;
 import org.apache.iceberg.util.JsonUtil;
 
 public class FileScanTaskParser {
-  private static final String SCHEMA = "schema";
-  private static final String SPEC = "spec";
-  private static final String DATA_FILE = "data-file";
-  private static final String START = "start";
-  private static final String LENGTH = "length";
-  private static final String DELETE_FILES = "delete-files";
-  private static final String RESIDUAL = "residual-filter";
+  private static final String TASK_TYPE = "task-type";
+
+  private enum TaskType {

Review Comment:
   I feel the Java class names are starting to get confusing, even at the 
interface level. Although `DataTask` extends `FileScanTask`, there is really no 
file that the `DataTask` is scanning. So although `StaticDataTask` and 
`BaseFileScanTask` both implements `FileScanTask`, they are not really 
conceptually similar. The actual relationship is more like `StaticDataTask` 
implements `DataTask`, `BaseFileScanTask` implements `FileScanTask`, but 
`DataTask` and `FileScanTask` have many things in common so we made  `DataTask` 
extends `FileScanTask`.
   
   To resolve this confusing situation and potentially accommodate other types 
of scan task, I am thinking if we should go one more layer above, so we keep 
the file scan task spec as is, and on top of that, have a serialization spec 
for **scan task**, which has types like `file-scan-task`, `data-task`. And we 
then have `ScanTaskParser` that delegates to the existing `FileScanTaskParser` 
or the `DataTaskParser`. What do you think?



##########
format/spec.md:
##########
@@ -1237,17 +1237,36 @@ Content file (data or delete) is serialized as a JSON 
object according to the fo
 | **`equality-ids`**       |`JSON list of int: Field ids used to determine row 
equality in equality delete files`|`[1]`|
 | **`sort-order-id`**      |`JSON int`|`1`|
 
-### File Scan Task Serialization
-
-File scan task is serialized as a JSON object according to the following table.
-
-| Metadata field       |JSON representation|Example|
-|--------------------------|--- |--- |
-| **`schema`**          |`JSON object`|`See above, read schemas instead`|
-| **`spec`**            |`JSON object`|`See above, read partition specs 
instead`|
-| **`data-file`**       |`JSON object`|`See above, read content file instead`|
-| **`delete-files`**    |`JSON list of objects`|`See above, read content file 
instead`|
-| **`residual-filter`** |`JSON object: residual filter 
expression`|`{"type":"eq","term":"id","value":1}`|
+### Task Serialization
+
+There are different task implementations, e.g. `BaseFileScanTask` and 
`StaticDataTask` in Java.
+A `task-type` field is needed to distinguish different task types.
+
+| Metadata field  | JSON representation | Example                              
                                                          |
+|-----------------|---------------------|------------------------------------------------------------------------------------------------|
+| **`task-type`** | `JSON string`       | `file-scan-task`, `data-task`. 
Absence of this field should be interpreted as `file-scan-task` |
+
+`file-scan-task` represents a scan task with a data file and maybe delete 
files.

Review Comment:
   nit: "and maybe delete files" -> "and optional delete files that should be 
applied to the data file"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec, Core: add a new task-type field to task JSON serialization. add data task JSON serialization spec and imp. [iceberg]

Reply via email to