Re: [PR] Data: Add a util to read write partition stats [iceberg]

via GitHub Fri, 23 Aug 2024 15:37:02 -0700


aokolnychyi commented on PR #10176:
URL: https://github.com/apache/iceberg/pull/10176#issuecomment-2307890941


   I took a detailed look at the implementation/spec today. Here are my 
high-level points:
   
   - The local algorithm was the right choice. I am on board with using 
`ConcurrentHashMap` and coerced partitions as keys.
   - The logic is split between 3 utility classes that are hard to navigate.
   - In my view, `Record` is not the best abstraction to expose when computing 
stats in `core`.
       - Projecting and getting values from `Record` is very fragile.
       - It is necessary to have `Record` to use generic writers but 
`PartitionStatsUtil` is in `core` and is supposed to be helpful for arbitrary 
implementations, not only those that rely on generic writers.
       - The `fromManifest` method in `PartitionStatsUtil` does not return a 
valid `Record` as it doesn’t transform values (it shouldn’t, the writer has to 
do that). We need a better abstraction in `core`.
   
   I'd consider restructuring the code as below (names are suggestions):
   
   1. Add a wrapper class to represent partition stats in `core`.
   
   ```
   public class PartitionStats {
   
     private final int specId;
     private final StructLike partition;
     ...
     private Long lastUpdatedSnapshotId = null;
   
     public void liveEntry(ContentFile<?> file, Snapshot snapshot) {
       // move the existing logic for updating stats from live entires here
     }
   
     public void deletedEntry(Snapshot snapshot) {
       if (snapshot != null) {
         // update last modified info to have accurate stats
       }
     }
   }
   ```
   
   2. Keep `PartitionStatsUtil` in `core` but change its purpose. It does 
everything except writing.
   
   ```
   public class PartitionStatsUtil {
   
     public static final int PARTITION_FIELD_ID = 1;
     public static final String PARTITION_FIELD_NAME = "partition";
     public static final NestedField SPEC_ID = required(2, "spec_id", 
IntegerType.get());
     ...
    
     public static Schema schema(StructType partitionType) {
       // return the schema of records in partition stats files
     }
   
     public static Iterable<PartitionStats> computeStats(Table table, Snapshot 
snapshot) {
       // compute stats like in this PR using ConcurrentHashMap
       // don't convert ManifestEntry to PartitionStats, invoke 
liveEntry/deletedEntry to update stats as needed
     }
   
     public static List<PartitionStats> sortStats(Iterable<PartitionStats> 
stats, StructType partitionType) {
       // sort stats per spec if they have to be written (having this 
separately is optional)
     }
   }
   ```
   
   3. Add `PartitionStatsWriter` that would wrap `PartitionStats` as `Record` 
internally and write using generic writers (like in this PR). The writer will 
be responsible for converting values using `IdentityPartitionConverters`.
   
   What do you think, @ajantha-bhat?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Data: Add a util to read write partition stats [iceberg]

Reply via email to