amogh-jahagirdar commented on code in PR #11040: URL: https://github.com/apache/iceberg/pull/11040#discussion_r1861360140
########## api/src/main/java/org/apache/iceberg/Table.java: ########## @@ -373,4 +374,14 @@ default Snapshot snapshot(String name) { return null; } + + /** + * Returns the statistics file for the given snapshot id, if available. + * + * @return the {@link StatisticsFile} for the given snapshot id, if available. + */ + default Optional<StatisticsFile> statistics(long snapshotId) { Review Comment: Thanks for adding this, I think this API is in the right direction but I wonder if it's taking enough information to be able to resolve the right file. There's an implicit assumption in the current implementation of this API that a Puffin file will have all the blobs needed. I think it's possible that there could multiple puffins, and one puffin has blob type NDV and the other puffin has blob type SomeFutureIndex. both of these puffin files are for snapshot 1. In the current implementation if someone wants the Puffin where there's SomeFutureIndex, but we happen to return the Puffin due to the generic "Find me a puffin with the given snapshot logic" I feel like the API isn't really doing what it needs. Here's the signature I'm thinking at the moment: ``` default StatisticsFile statisticsFor(long snapshotId, String blobType) ``` This will attempt to attempt to find the statistics file produced at the given snapshot which contains the blob type. I think this future proofs the API a bit more and makes it more useful in case a user is really only caring about a particular blob type which may or may not exist in some other file. I also think that in case a statistics file for exactly that snapshot couldn't be found, we should return the latest statistics rather than just return nothing. While it's possible that the statistics are not completely accurate in this approach, I think in the average case data distributions wouldn't change so drastically between snapshots that the statistics would work horribly against the query. It's probably better in the average case to have some statistics. If there's no there's no statistics file containing the desired blob then I think we'd just return null. WDYT? ########## api/src/main/java/org/apache/iceberg/Table.java: ########## @@ -373,4 +374,14 @@ default Snapshot snapshot(String name) { return null; } + + /** + * Returns the statistics file for the given snapshot id, if available. + * + * @return the {@link StatisticsFile} for the given snapshot id, if available. + */ + default Optional<StatisticsFile> statistics(long snapshotId) { Review Comment: Thanks for adding this, I think this API is in the right direction but I wonder if it's taking enough information to be able to resolve the right file. There's an implicit assumption in the current implementation of this API that a Puffin file will have all the blobs needed. I think it's possible that there could multiple puffins, and one puffin has blob type NDV and the other puffin has blob type SomeFutureIndex. both of these puffin files are for snapshot 1. In the current implementation if someone wants the Puffin where there's SomeFutureIndex, but we happen to return the Puffin due to the generic "Find me a puffin with the given snapshot logic" I feel like the API isn't really doing what it needs. Here's the signature I'm thinking at the moment: ``` default StatisticsFile statisticsFor(long snapshotId, String blobType) ``` This will attempt to attempt to find the statistics file produced at the given snapshot which contains the blob type. I think this future proofs the API a bit more and makes it more useful in case a user is really only caring about a particular blob type which may or may not exist in some other Puffin file. I also think that in case a statistics file for exactly that snapshot couldn't be found, we should return the latest statistics rather than just return nothing. While it's possible that the statistics are not completely accurate in this approach, I think in the average case data distributions wouldn't change so drastically between snapshots that the statistics would work horribly against the query. It's probably better in the average case to have some statistics. If there's no there's no statistics file containing the desired blob then I think we'd just return null. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org