marvinlanhenke commented on code in PR #241: URL: https://github.com/apache/iceberg-rust/pull/241#discussion_r1547190841
########## crates/iceberg/src/scan.rs: ########## @@ -155,8 +189,22 @@ impl TableScan { .await?; // Generate data file stream - let mut entries = iter(manifest_list.entries()); - while let Some(entry) = entries.next().await { + for entry in manifest_list.entries() { + // If this scan has a filter, check the partition evaluator cache for an existing + // PartitionEvaluator that matches this manifest's partition spec ID. + // Use one from the cache if there is one. If not, create one, put it in + // the cache, and take a reference to it. + if let Some(filter) = self.filter.as_ref() { + let partition_evaluator = partition_evaluator_cache + .entry(entry.partition_spec_id()) + .or_insert_with_key(|key| self.create_partition_evaluator(key, filter)); + + // reject any manifest files whose partition values don't match the filter. + if !partition_evaluator.filter_manifest_file(entry) { Review Comment: I think we should apply `ManifestEvaluator` independently from `PartitionEvaluator` (see comment down below in struct PartitionEvaluator). Also, I'm not sure if we can simply 'skip' the file with continue. In the python impl the Partition- and MetricsEvaluator are applied on each DataFile on multiple threads - so I believe we have to collect all relevant manifest files before and then apply the Partition- and MetricsEvaluator on the complete (pre-filtered) collection, in order to split the work into multiple threads? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org