jackye1995 commented on code in PR #9728: URL: https://github.com/apache/iceberg/pull/9728#discussion_r1492816128
########## core/src/main/java/org/apache/iceberg/FileScanTaskParser.java: ########## @@ -21,116 +21,84 @@ import com.fasterxml.jackson.core.JsonGenerator; import com.fasterxml.jackson.databind.JsonNode; import java.io.IOException; -import org.apache.iceberg.expressions.Expression; -import org.apache.iceberg.expressions.ExpressionParser; -import org.apache.iceberg.expressions.Expressions; -import org.apache.iceberg.expressions.ResidualEvaluator; import org.apache.iceberg.relocated.com.google.common.base.Preconditions; -import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; +import org.apache.iceberg.relocated.com.google.common.base.Strings; import org.apache.iceberg.util.JsonUtil; public class FileScanTaskParser { - private static final String SCHEMA = "schema"; - private static final String SPEC = "spec"; - private static final String DATA_FILE = "data-file"; - private static final String START = "start"; - private static final String LENGTH = "length"; - private static final String DELETE_FILES = "delete-files"; - private static final String RESIDUAL = "residual-filter"; + private static final String TASK_TYPE = "task-type"; + + private enum TaskType { Review Comment: I feel the Java class names are starting to get confusing, even at the interface level. Although `DataTask` extends `FileScanTask`, there is really no file that the `DataTask` is scanning. So although `StaticDataTask` and `BaseFileScanTask` both implements `FileScanTask`, they are not really conceptually similar. The actual relationship is more like `StaticDataTask` implements `DataTask`, `BaseFileScanTask` implements `FileScanTask`, but `DataTask` and `FileScanTask` have many things in common so we made `DataTask` extends `FileScanTask`. To resolve this confusing situation and potentially accommodate other types of scan task, I am thinking if we should go one more layer above, so we keep the file scan task spec as is, and on top of that, have a serialization spec for **scan task**, which has types like `file-scan-task`, `data-task`. And we then have `ScanTaskParser` that delegates to the existing `FileScanTaskParser` or the `DataTaskParser`. What do you think? ########## format/spec.md: ########## @@ -1237,17 +1237,36 @@ Content file (data or delete) is serialized as a JSON object according to the fo | **`equality-ids`** |`JSON list of int: Field ids used to determine row equality in equality delete files`|`[1]`| | **`sort-order-id`** |`JSON int`|`1`| -### File Scan Task Serialization - -File scan task is serialized as a JSON object according to the following table. - -| Metadata field |JSON representation|Example| -|--------------------------|--- |--- | -| **`schema`** |`JSON object`|`See above, read schemas instead`| -| **`spec`** |`JSON object`|`See above, read partition specs instead`| -| **`data-file`** |`JSON object`|`See above, read content file instead`| -| **`delete-files`** |`JSON list of objects`|`See above, read content file instead`| -| **`residual-filter`** |`JSON object: residual filter expression`|`{"type":"eq","term":"id","value":1}`| +### Task Serialization + +There are different task implementations, e.g. `BaseFileScanTask` and `StaticDataTask` in Java. +A `task-type` field is needed to distinguish different task types. + +| Metadata field | JSON representation | Example | +|-----------------|---------------------|------------------------------------------------------------------------------------------------| +| **`task-type`** | `JSON string` | `file-scan-task`, `data-task`. Absence of this field should be interpreted as `file-scan-task` | + +`file-scan-task` represents a scan task with a data file and maybe delete files. Review Comment: nit: "and maybe delete files" -> "and optional delete files that should be applied to the data file" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org