aokolnychyi commented on PR #8123: URL: https://github.com/apache/iceberg/pull/8123#issuecomment-1645972945
@zinking, I know that paper and there are a few ideas in it that may be applicable to us. At the same time, Iceberg metadata already forms a system table which we query in a distributed manner (`data_files`, etc), it is similar to having a separate table for metadata. If I remember correctly, one benefit of BigQuery is that it does not have to bring back the results while doing this distributed planning but that's on Spark side to provide that functionality. In my view, it is less likely one would query 20 PB of data in a single job and when that happens, it is unlikely to be a problem to spend 30 seconds planning the job. At this point, we are not storing large blobs in the manifests so we will come back to that paper while discussing how to support secondary indexes and how to integrate that into planning. That would be the point when the metadata would be too big. Right now, we are talking about 2-4 GB of metadata to cover 20-40 PB of data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
