aokolnychyi commented on PR #10176: URL: https://github.com/apache/iceberg/pull/10176#issuecomment-2307890941
I took a detailed look at the implementation/spec today. Here are my high-level points: - The local algorithm was the right choice. I am on board with using `ConcurrentHashMap` and coerced partitions as keys. - The logic is split between 3 utility classes that are hard to navigate. - In my view, `Record` is not the best abstraction to expose when computing stats in `core`. - Projecting and getting values from `Record` is very fragile. - It is necessary to have `Record` to use generic writers but `PartitionStatsUtil` is in `core` and is supposed to be helpful for arbitrary implementations, not only those that rely on generic writers. - The `fromManifest` method in `PartitionStatsUtil` does not return a valid `Record` as it doesn’t transform values (it shouldn’t, the writer has to do that). We need a better abstraction in `core`. I'd consider restructuring the code as below (names are suggestions): 1. Add a wrapper class to represent partition stats in `core`. ``` public class PartitionStats { private final int specId; private final StructLike partition; ... private Long lastUpdatedSnapshotId = null; public void liveEntry(ContentFile<?> file, Snapshot snapshot) { // move the existing logic for updating stats from live entires here } public void deletedEntry(Snapshot snapshot) { if (snapshot != null) { // update last modified info to have accurate stats } } } ``` 2. Keep `PartitionStatsUtil` in `core` but change its purpose. It does everything except writing. ``` public class PartitionStatsUtil { public static final int PARTITION_FIELD_ID = 1; public static final String PARTITION_FIELD_NAME = "partition"; public static final NestedField SPEC_ID = required(2, "spec_id", IntegerType.get()); ... public static Schema schema(StructType partitionType) { // return the schema of records in partition stats files } public static Iterable<PartitionStats> computeStats(Table table, Snapshot snapshot) { // compute stats like in this PR using ConcurrentHashMap // don't convert ManifestEntry to PartitionStats, invoke liveEntry/deletedEntry to update stats as needed } public static List<PartitionStats> sortStats(Iterable<PartitionStats> stats, StructType partitionType) { // sort stats per spec if they have to be written (having this separately is optional) } } ``` 3. Add `PartitionStatsWriter` that would wrap `PartitionStats` as `Record` internally and write using generic writers (like in this PR). The writer will be responsible for converting values using `IdentityPartitionConverters`. What do you think, @ajantha-bhat? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org