Re: [I] Kafka Connect: Record projection Index out of bounds error [iceberg]
ismailsimsek commented on issue #11099: URL: https://github.com/apache/iceberg/issues/11099#issuecomment-2336596778 Just found the issue! it was the issue when generating `GenericAppenderFactory` in which full table schema was given, instead of key schema https://github.com/tabular-io/iceberg-kafka-connect/blob/595f835f5d9174e57660b12f407dabc84781e500/kafka-connect/src/test/java/io/tabular/iceberg/connect/data2/IcebergUtil.java#L96-L103 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] build(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.27.30 to 1.27.33 [iceberg-go]
dependabot[bot] opened a new pull request, #140: URL: https://github.com/apache/iceberg-go/pull/140 Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.27.30 to 1.27.33. Commits https://github.com/aws/aws-sdk-go-v2/commit/f1d71c59a1499981873f70db8c92d07242779670";>f1d71c5 Release 2024-09-04 https://github.com/aws/aws-sdk-go-v2/commit/e4813e114f3c2ce8db07624f656073440adc1140";>e4813e1 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/0e8bb90d8c0d03824fc8bf8656bd2de6346e48f8";>0e8bb90 Update partitions file https://github.com/aws/aws-sdk-go-v2/commit/6a5875b13d01e8270c1cbdfae0c63ee479386917";>6a5875b Update endpoints model https://github.com/aws/aws-sdk-go-v2/commit/f7030de42b2a92ad4f66b153ae4d88ad28a6b2fe";>f7030de Update API model https://github.com/aws/aws-sdk-go-v2/commit/a2b751d1ba71f59175a41f9cae5f159f1044360f";>a2b751d Release 2024-09-03 https://github.com/aws/aws-sdk-go-v2/commit/e22c2495bd38232c640776ef3c1a84fb3145a8b9";>e22c249 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/ff0cf6fbfa7c7a12388606d5568135f2beae6599";>ff0cf6f Update API model https://github.com/aws/aws-sdk-go-v2/commit/3120376762853b3098fda7e9217fb39fe1bf5105";>3120376 refactoring of buildQuery to accept a list of maintained headers to l… (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2773";>#2773) https://github.com/aws/aws-sdk-go-v2/commit/4ed838eab2a963cb16301501c8b8c3e29dac4c20";>4ed838e Merge pull request https://redirect.github.com/aws/aws-sdk-go-v2/issues/2768";>#2768 from bhavya2109sharma/presignedurl-requestpayer-change Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/config/v1.27.30...config/v1.27.33";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] build(deps): bump github.com/aws/aws-sdk-go-v2 from 1.30.4 to 1.30.5 [iceberg-go]
dependabot[bot] opened a new pull request, #141: URL: https://github.com/apache/iceberg-go/pull/141 Bumps [github.com/aws/aws-sdk-go-v2](https://github.com/aws/aws-sdk-go-v2) from 1.30.4 to 1.30.5. Commits https://github.com/aws/aws-sdk-go-v2/commit/a2b751d1ba71f59175a41f9cae5f159f1044360f";>a2b751d Release 2024-09-03 https://github.com/aws/aws-sdk-go-v2/commit/e22c2495bd38232c640776ef3c1a84fb3145a8b9";>e22c249 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/ff0cf6fbfa7c7a12388606d5568135f2beae6599";>ff0cf6f Update API model https://github.com/aws/aws-sdk-go-v2/commit/3120376762853b3098fda7e9217fb39fe1bf5105";>3120376 refactoring of buildQuery to accept a list of maintained headers to l… (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2773";>#2773) https://github.com/aws/aws-sdk-go-v2/commit/4ed838eab2a963cb16301501c8b8c3e29dac4c20";>4ed838e Merge pull request https://redirect.github.com/aws/aws-sdk-go-v2/issues/2768";>#2768 from bhavya2109sharma/presignedurl-requestpayer-change https://github.com/aws/aws-sdk-go-v2/commit/d4bd42fdc82c2954f429bd9eeac3096f5590eaac";>d4bd42f Merge branch 'main' into presignedurl-requestpayer-change https://github.com/aws/aws-sdk-go-v2/commit/0353706229a89749afa93324432ece53da37822b";>0353706 Added Changelog https://github.com/aws/aws-sdk-go-v2/commit/97e2d3fec04bc8bfd69eaf22ce476b12f8673810";>97e2d3f Release 2024-08-30 https://github.com/aws/aws-sdk-go-v2/commit/4cca52b4a81df57e91b1e5d0a65fd6df89606d02";>4cca52b Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/c8a5146734eb1e6d59acba18b5967c78dc7a5e42";>c8a5146 Update endpoints model Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/v1.30.4...v1.30.5";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] build(deps): bump github.com/aws/aws-sdk-go-v2/credentials from 1.17.29 to 1.17.32 [iceberg-go]
dependabot[bot] opened a new pull request, #143: URL: https://github.com/apache/iceberg-go/pull/143 Bumps [github.com/aws/aws-sdk-go-v2/credentials](https://github.com/aws/aws-sdk-go-v2) from 1.17.29 to 1.17.32. Commits https://github.com/aws/aws-sdk-go-v2/commit/f1d71c59a1499981873f70db8c92d07242779670";>f1d71c5 Release 2024-09-04 https://github.com/aws/aws-sdk-go-v2/commit/e4813e114f3c2ce8db07624f656073440adc1140";>e4813e1 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/0e8bb90d8c0d03824fc8bf8656bd2de6346e48f8";>0e8bb90 Update partitions file https://github.com/aws/aws-sdk-go-v2/commit/6a5875b13d01e8270c1cbdfae0c63ee479386917";>6a5875b Update endpoints model https://github.com/aws/aws-sdk-go-v2/commit/f7030de42b2a92ad4f66b153ae4d88ad28a6b2fe";>f7030de Update API model https://github.com/aws/aws-sdk-go-v2/commit/a2b751d1ba71f59175a41f9cae5f159f1044360f";>a2b751d Release 2024-09-03 https://github.com/aws/aws-sdk-go-v2/commit/e22c2495bd38232c640776ef3c1a84fb3145a8b9";>e22c249 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/ff0cf6fbfa7c7a12388606d5568135f2beae6599";>ff0cf6f Update API model https://github.com/aws/aws-sdk-go-v2/commit/3120376762853b3098fda7e9217fb39fe1bf5105";>3120376 refactoring of buildQuery to accept a list of maintained headers to l… (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2773";>#2773) https://github.com/aws/aws-sdk-go-v2/commit/4ed838eab2a963cb16301501c8b8c3e29dac4c20";>4ed838e Merge pull request https://redirect.github.com/aws/aws-sdk-go-v2/issues/2768";>#2768 from bhavya2109sharma/presignedurl-requestpayer-change Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/credentials/v1.17.29...credentials/v1.17.32";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] build(deps): bump github.com/aws/aws-sdk-go-v2/credentials from 1.17.29 to 1.17.30 [iceberg-go]
dependabot[bot] closed pull request #138: build(deps): bump github.com/aws/aws-sdk-go-v2/credentials from 1.17.29 to 1.17.30 URL: https://github.com/apache/iceberg-go/pull/138 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] build(deps): bump github.com/aws/aws-sdk-go-v2/credentials from 1.17.29 to 1.17.30 [iceberg-go]
dependabot[bot] commented on PR #138: URL: https://github.com/apache/iceberg-go/pull/138#issuecomment-2336599806 Superseded by #143. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] feat: SortOrder methods should take schema ref if possible [iceberg-rust]
c-thiel opened a new pull request, #613: URL: https://github.com/apache/iceberg-rust/pull/613 Reduce clones I also added two tests that didn't fail but where missing IMO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Feat: Reassign field ids for schema [iceberg-rust]
c-thiel opened a new pull request, #615: URL: https://github.com/apache/iceberg-rust/pull/615 Required for `TableMetadataBuilder` (https://github.com/apache/iceberg-rust/pull/587). When new TableMetadata is created, field ids should be reassign to ensure that field numbering is normalized and started from 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] TableMetadataBuilder [iceberg-rust]
c-thiel commented on PR #587: URL: https://github.com/apache/iceberg-rust/pull/587#issuecomment-2336618257 @liurenjie1024 I tried to cut a few things out - but not along the lines of `TalbeUpdate`. I hope that's OK? 1. https://github.com/apache/iceberg-rust/pull/611 2. https://github.com/apache/iceberg-rust/pull/612 3. https://github.com/apache/iceberg-rust/pull/613 4. https://github.com/apache/iceberg-rust/pull/614 5. https://github.com/apache/iceberg-rust/pull/615 After they are all merged, I'll rebase this PR for the actual builder. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate [iceberg-rust]
a-agmon commented on code in PR #588: URL: https://github.com/apache/iceberg-rust/pull/588#discussion_r1749162764 ## crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs: ## @@ -0,0 +1,312 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::VecDeque; + +use datafusion::common::tree_node::{TreeNodeRecursion, TreeNodeVisitor}; +use datafusion::common::Column; +use datafusion::error::DataFusionError; +use datafusion::logical_expr::{Expr, Operator}; +use datafusion::scalar::ScalarValue; +use iceberg::expr::{Predicate, Reference}; +use iceberg::spec::Datum; + +pub struct ExprToPredicateVisitor { +stack: VecDeque>, +} +impl ExprToPredicateVisitor { +/// Create a new predicate conversion visitor. +pub fn new() -> Self { +Self { +stack: VecDeque::new(), +} +} +/// Get the predicate from the stack. +pub fn get_predicate(&self) -> Option { +self.stack +.iter() +.filter_map(|opt| opt.clone()) +.reduce(Predicate::and) +} + +/// Convert a column expression to an iceberg predicate. +fn convert_column_expr( +&self, +col: &Column, +op: &Operator, +lit: &ScalarValue, +) -> Option { +let reference = Reference::new(col.name.clone()); +let datum = scalar_value_to_datum(lit)?; +Some(binary_op_to_predicate(reference, op, datum)) +} + +/// Convert a compound expression to an iceberg predicate. +/// +/// The strategy is to support the following cases: +/// - if its an AND expression then the result will be the valid predicates, whether there are 2 or just 1 +/// - if its an OR expression then a predicate will be returned only if there are 2 valid predicates on both sides +fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option { +let valid_preds_count = valid_preds.len(); +match (op, valid_preds_count) { +(Operator::And, 1) => valid_preds.first().cloned(), +(Operator::And, 2) => Some(Predicate::and( +valid_preds[0].clone(), +valid_preds[1].clone(), +)), +(Operator::Or, 2) => Some(Predicate::or( +valid_preds[0].clone(), +valid_preds[1].clone(), +)), +_ => None, +} +} +} + +// Implement TreeNodeVisitor for ExprToPredicateVisitor +impl<'n> TreeNodeVisitor<'n> for ExprToPredicateVisitor { +type Node = Expr; + +fn f_down(&mut self, _node: &'n Self::Node) -> Result { +Ok(TreeNodeRecursion::Continue) +} + +fn f_up(&mut self, node: &'n Self::Node) -> Result { +if let Expr::BinaryExpr(binary) = node { +match (&*binary.left, &binary.op, &*binary.right) { +// process simple expressions (involving a column, operator and literal) +(Expr::Column(col), op, Expr::Literal(lit)) => { Review Comment: Thanks for the comment @liurenjie1024 I have added support for both "directions" of binary statement: `X > 1` and `1 < X`. But I think there is something more to be clarified: Both `x>1` and `1 1` is processed then there are 3 main leafs to handle: ``` Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" }) ``` This ^ will be ignored as its not a binary expr (its a column expr) ``` Literal(Int64(1)) ``` This ^ will also be ignored as its not a binary expr (its a literal expr) ``` BinaryExpr(BinaryExpr { left: Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" }), op: Gt, right: Literal(Int64(1)) }) ``` This binary expr will be captured, deconstructed, and processed by this match arm ``` match (&*binary.left, &binary.op, &*binary.right) { (Expr::Column(col), op, Expr::Literal(lit)) => { ... } ``` Therefore, in order to support `a>6` we dont need to capture `Expr:Column`, only the binary statement that includes it. On the other hand, I dont think we want to be too flexible because there are e
Re: [PR] feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate [iceberg-rust]
a-agmon commented on code in PR #588: URL: https://github.com/apache/iceberg-rust/pull/588#discussion_r1749162764 ## crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs: ## @@ -0,0 +1,312 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::VecDeque; + +use datafusion::common::tree_node::{TreeNodeRecursion, TreeNodeVisitor}; +use datafusion::common::Column; +use datafusion::error::DataFusionError; +use datafusion::logical_expr::{Expr, Operator}; +use datafusion::scalar::ScalarValue; +use iceberg::expr::{Predicate, Reference}; +use iceberg::spec::Datum; + +pub struct ExprToPredicateVisitor { +stack: VecDeque>, +} +impl ExprToPredicateVisitor { +/// Create a new predicate conversion visitor. +pub fn new() -> Self { +Self { +stack: VecDeque::new(), +} +} +/// Get the predicate from the stack. +pub fn get_predicate(&self) -> Option { +self.stack +.iter() +.filter_map(|opt| opt.clone()) +.reduce(Predicate::and) +} + +/// Convert a column expression to an iceberg predicate. +fn convert_column_expr( +&self, +col: &Column, +op: &Operator, +lit: &ScalarValue, +) -> Option { +let reference = Reference::new(col.name.clone()); +let datum = scalar_value_to_datum(lit)?; +Some(binary_op_to_predicate(reference, op, datum)) +} + +/// Convert a compound expression to an iceberg predicate. +/// +/// The strategy is to support the following cases: +/// - if its an AND expression then the result will be the valid predicates, whether there are 2 or just 1 +/// - if its an OR expression then a predicate will be returned only if there are 2 valid predicates on both sides +fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option { +let valid_preds_count = valid_preds.len(); +match (op, valid_preds_count) { +(Operator::And, 1) => valid_preds.first().cloned(), +(Operator::And, 2) => Some(Predicate::and( +valid_preds[0].clone(), +valid_preds[1].clone(), +)), +(Operator::Or, 2) => Some(Predicate::or( +valid_preds[0].clone(), +valid_preds[1].clone(), +)), +_ => None, +} +} +} + +// Implement TreeNodeVisitor for ExprToPredicateVisitor +impl<'n> TreeNodeVisitor<'n> for ExprToPredicateVisitor { +type Node = Expr; + +fn f_down(&mut self, _node: &'n Self::Node) -> Result { +Ok(TreeNodeRecursion::Continue) +} + +fn f_up(&mut self, node: &'n Self::Node) -> Result { +if let Expr::BinaryExpr(binary) = node { +match (&*binary.left, &binary.op, &*binary.right) { +// process simple expressions (involving a column, operator and literal) +(Expr::Column(col), op, Expr::Literal(lit)) => { Review Comment: Thanks for the comment @liurenjie1024 I have added support for both "directions" of binary statement: `X > 1` and `1 < X`. But I think there is something more to be clarified: Both `x>1` and `1 1` is processed then there are 3 main leafs to handle: ``` 1 - Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" }) ``` This ^ will be ignored as its not a binary expr (its a column expr) ``` 2 - Literal(Int64(1)) ``` This ^ will also be ignored as its not a binary expr (its a literal expr) ``` 3 - BinaryExpr(BinaryExpr { left: Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" }), op: Gt, right: Literal(Int64(1)) }) ``` This binary expr will be captured, deconstructed, and processed by this match arm ``` match (&*binary.left, &binary.op, &*binary.right) { (Expr::Column(col), op, Expr::Literal(lit)) => { ... } ``` Therefore, in order to support `a>6` we dont need to capture `Expr:Column`, only the binary statement that includes it. On the other hand, I dont think we want to be too flexible because
Re: [PR] feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate [iceberg-rust]
a-agmon commented on code in PR #588: URL: https://github.com/apache/iceberg-rust/pull/588#discussion_r1749163419 ## crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs: ## @@ -0,0 +1,312 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::VecDeque; + +use datafusion::common::tree_node::{TreeNodeRecursion, TreeNodeVisitor}; +use datafusion::common::Column; +use datafusion::error::DataFusionError; +use datafusion::logical_expr::{Expr, Operator}; +use datafusion::scalar::ScalarValue; +use iceberg::expr::{Predicate, Reference}; +use iceberg::spec::Datum; + +pub struct ExprToPredicateVisitor { +stack: VecDeque>, +} +impl ExprToPredicateVisitor { +/// Create a new predicate conversion visitor. +pub fn new() -> Self { +Self { +stack: VecDeque::new(), +} +} +/// Get the predicate from the stack. +pub fn get_predicate(&self) -> Option { +self.stack +.iter() +.filter_map(|opt| opt.clone()) +.reduce(Predicate::and) +} + +/// Convert a column expression to an iceberg predicate. +fn convert_column_expr( +&self, +col: &Column, +op: &Operator, +lit: &ScalarValue, +) -> Option { +let reference = Reference::new(col.name.clone()); +let datum = scalar_value_to_datum(lit)?; +Some(binary_op_to_predicate(reference, op, datum)) +} + +/// Convert a compound expression to an iceberg predicate. +/// +/// The strategy is to support the following cases: +/// - if its an AND expression then the result will be the valid predicates, whether there are 2 or just 1 +/// - if its an OR expression then a predicate will be returned only if there are 2 valid predicates on both sides +fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option { +let valid_preds_count = valid_preds.len(); +match (op, valid_preds_count) { +(Operator::And, 1) => valid_preds.first().cloned(), Review Comment: I basically designed the conversion of filters to be greedy, which means we need to try and convert as many filters as we can in order to scan less files. suppose we have a case like this: ``` SELECT * FROM T WHERE A > 10 AND REGEXP_LIKE(name, '[a-z]A[a-zA-Z]{2}'); ``` This will translate into an expr that looks something like this ``` BinaryExpr(BinaryExpr AND RegExpExpr) ``` After we processed both leafs: `( BinaryExpr and RegExpExpr)`, we have on the stack just one valid expression. Then we get to the parent: BinaryExpr(A AND B), we try to get the 2 child expressions from the stack, but in this case we get only one (`A > 10`) because the second one is not supported (`REGEXP_LIKE`). In this case, because its a logical **AND** expr, we proceed with just the valid predicate. If it was an **OR** expr that was wrapping both then we would have failed the whole predicate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate [iceberg-rust]
a-agmon commented on code in PR #588: URL: https://github.com/apache/iceberg-rust/pull/588#discussion_r1749163419 ## crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs: ## @@ -0,0 +1,312 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::VecDeque; + +use datafusion::common::tree_node::{TreeNodeRecursion, TreeNodeVisitor}; +use datafusion::common::Column; +use datafusion::error::DataFusionError; +use datafusion::logical_expr::{Expr, Operator}; +use datafusion::scalar::ScalarValue; +use iceberg::expr::{Predicate, Reference}; +use iceberg::spec::Datum; + +pub struct ExprToPredicateVisitor { +stack: VecDeque>, +} +impl ExprToPredicateVisitor { +/// Create a new predicate conversion visitor. +pub fn new() -> Self { +Self { +stack: VecDeque::new(), +} +} +/// Get the predicate from the stack. +pub fn get_predicate(&self) -> Option { +self.stack +.iter() +.filter_map(|opt| opt.clone()) +.reduce(Predicate::and) +} + +/// Convert a column expression to an iceberg predicate. +fn convert_column_expr( +&self, +col: &Column, +op: &Operator, +lit: &ScalarValue, +) -> Option { +let reference = Reference::new(col.name.clone()); +let datum = scalar_value_to_datum(lit)?; +Some(binary_op_to_predicate(reference, op, datum)) +} + +/// Convert a compound expression to an iceberg predicate. +/// +/// The strategy is to support the following cases: +/// - if its an AND expression then the result will be the valid predicates, whether there are 2 or just 1 +/// - if its an OR expression then a predicate will be returned only if there are 2 valid predicates on both sides +fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option { +let valid_preds_count = valid_preds.len(); +match (op, valid_preds_count) { +(Operator::And, 1) => valid_preds.first().cloned(), Review Comment: I basically designed the conversion of filters to be greedy, which means we need to try and convert as many filters as we can in order to scan less files. suppose we have a case like this: ``` SELECT * FROM T WHERE A > 10 AND REGEXP_LIKE(name, '[a-z]A[a-zA-Z]{2}'); ``` This will translate into an expr that looks something like this ``` BinaryExpr(BinaryExpr AND RegExpExpr) ``` After we processed both leafs: `( BinaryExpr and RegExpExpr)`, we have on the stack just one valid expression. Then we get to the parent: `BinaryExpr(A AND B)`, we try to get the 2 child expressions from the stack, but in this case we get only one (`A > 10`) because the second one is not supported (`REGEXP_LIKE`). In this case, because its a logical **AND** expr, we proceed with just the valid predicate. If it was an **OR** expr that was wrapping both then we would have failed the whole predicate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate [iceberg-rust]
a-agmon commented on code in PR #588: URL: https://github.com/apache/iceberg-rust/pull/588#discussion_r1749163419 ## crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs: ## @@ -0,0 +1,312 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::VecDeque; + +use datafusion::common::tree_node::{TreeNodeRecursion, TreeNodeVisitor}; +use datafusion::common::Column; +use datafusion::error::DataFusionError; +use datafusion::logical_expr::{Expr, Operator}; +use datafusion::scalar::ScalarValue; +use iceberg::expr::{Predicate, Reference}; +use iceberg::spec::Datum; + +pub struct ExprToPredicateVisitor { +stack: VecDeque>, +} +impl ExprToPredicateVisitor { +/// Create a new predicate conversion visitor. +pub fn new() -> Self { +Self { +stack: VecDeque::new(), +} +} +/// Get the predicate from the stack. +pub fn get_predicate(&self) -> Option { +self.stack +.iter() +.filter_map(|opt| opt.clone()) +.reduce(Predicate::and) +} + +/// Convert a column expression to an iceberg predicate. +fn convert_column_expr( +&self, +col: &Column, +op: &Operator, +lit: &ScalarValue, +) -> Option { +let reference = Reference::new(col.name.clone()); +let datum = scalar_value_to_datum(lit)?; +Some(binary_op_to_predicate(reference, op, datum)) +} + +/// Convert a compound expression to an iceberg predicate. +/// +/// The strategy is to support the following cases: +/// - if its an AND expression then the result will be the valid predicates, whether there are 2 or just 1 +/// - if its an OR expression then a predicate will be returned only if there are 2 valid predicates on both sides +fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option { +let valid_preds_count = valid_preds.len(); +match (op, valid_preds_count) { +(Operator::And, 1) => valid_preds.first().cloned(), Review Comment: I basically designed the conversion of filters to be greedy, which means we need to try and convert as many filters as we can in order to scan less files. suppose we have a case like this: ``` SELECT * FROM T WHERE A > 10 AND REGEXP_LIKE(name, '[a-z]A[a-zA-Z]{2}'); ``` This will translate into an expr that looks something like this ``` BinaryExpr(BinaryExpr AND RegExpExpr) ``` After we processed both leafs: `( BinaryExpr and RegExpExpr)`, we have on the stack just one valid expression. Then we get to the parent: `BinaryExpr(A AND B)`, we try to get the 2 child expressions from the stack, but in this case we get only one (`A > 10`) because the second one is not supported (`REGEXP_LIKE`). In this case, because its a logical **AND** expr, we proceed with just one valid predicate. Thats the case you are highlighinting - cases where we process logical AND expr where just one side is valid. in this case - we are processing with just one predicate. If it was an **OR** expr that was wrapping both then we would have failed the whole predicate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate [iceberg-rust]
a-agmon commented on code in PR #588: URL: https://github.com/apache/iceberg-rust/pull/588#discussion_r1749163419 ## crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs: ## @@ -0,0 +1,312 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::VecDeque; + +use datafusion::common::tree_node::{TreeNodeRecursion, TreeNodeVisitor}; +use datafusion::common::Column; +use datafusion::error::DataFusionError; +use datafusion::logical_expr::{Expr, Operator}; +use datafusion::scalar::ScalarValue; +use iceberg::expr::{Predicate, Reference}; +use iceberg::spec::Datum; + +pub struct ExprToPredicateVisitor { +stack: VecDeque>, +} +impl ExprToPredicateVisitor { +/// Create a new predicate conversion visitor. +pub fn new() -> Self { +Self { +stack: VecDeque::new(), +} +} +/// Get the predicate from the stack. +pub fn get_predicate(&self) -> Option { +self.stack +.iter() +.filter_map(|opt| opt.clone()) +.reduce(Predicate::and) +} + +/// Convert a column expression to an iceberg predicate. +fn convert_column_expr( +&self, +col: &Column, +op: &Operator, +lit: &ScalarValue, +) -> Option { +let reference = Reference::new(col.name.clone()); +let datum = scalar_value_to_datum(lit)?; +Some(binary_op_to_predicate(reference, op, datum)) +} + +/// Convert a compound expression to an iceberg predicate. +/// +/// The strategy is to support the following cases: +/// - if its an AND expression then the result will be the valid predicates, whether there are 2 or just 1 +/// - if its an OR expression then a predicate will be returned only if there are 2 valid predicates on both sides +fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option { +let valid_preds_count = valid_preds.len(); +match (op, valid_preds_count) { +(Operator::And, 1) => valid_preds.first().cloned(), Review Comment: I basically designed the conversion of filters to be greedy, which means we need to try and convert as many filters as we can in order to scan less files. suppose we have a case like this: ``` SELECT * FROM T WHERE A > 10 AND REGEXP_LIKE(name, '[a-z]A[a-zA-Z]{2}'); ``` This will translate into an expr that looks something like this ``` BinaryExpr(BinaryExpr AND RegExpExpr) ``` After we processed both leafs: `( BinaryExpr and RegExpExpr)`, we have on the stack just one valid expression because the `RegExpExpr` is not supported. Then we get to the parent: `BinaryExpr(A AND B)`, we try to get the 2 child expressions from the stack, but in this case we get only one (`A > 10`) because the second one is not supported (`REGEXP_LIKE`). In this case, because its a logical **AND** expr, we proceed with just one valid predicate. Thats the case you are highlighinting - cases where we process logical AND expr where just one side is valid. in this case - we are processing with just one predicate. If it was an **OR** expr that was wrapping both then we would have failed the whole predicate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate [iceberg-rust]
a-agmon commented on PR #588: URL: https://github.com/apache/iceberg-rust/pull/588#issuecomment-2336631735 Thank you very much for the review and comments, @liurenjie1024 I think I have addressed all your comments and questions and would be happy if you can take a another look. I just want to make a comment which might explain some of my design choices. I think that the main value of converting data fusion expr to iceberg predicates, is that by pushing down some predicate to an iceberg `scan` we can prune data files and have data fusion scan less data. So, if there is some Expr that we can convert to Iceberg Predicate and then scan less files we should do it. Therefore I tried to implement this as greedy as possible - trying to convert as much relevant expr as possible, even when its just part of all the filters, and basically leave to data fusion to process the rest (which can sometimes be done better there) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 [iceberg-rust]
dependabot[bot] opened a new pull request, #616: URL: https://github.com/apache/iceberg-rust/pull/616 Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5. Release notes Sourced from https://github.com/crate-ci/typos/releases";>crate-ci/typos's releases. v1.24.5 [1.24.5] - 2024-09-04 Features (action) Support windows v1.24.4 [1.24.4] - 2024-09-03 Fixes Offer a correction for grather Changelog Sourced from https://github.com/crate-ci/typos/blob/master/CHANGELOG.md";>crate-ci/typos's changelog. [1.24.5] - 2024-09-04 Features (action) Support windows [1.24.4] - 2024-09-03 Fixes Offer a correction for grather Commits https://github.com/crate-ci/typos/commit/945d407a5fc9097f020969446a16f581612ab4df";>945d407 chore: Release https://github.com/crate-ci/typos/commit/e972e95d223e37b01e2c21a456b1c70e79836a5d";>e972e95 docs: Update changelog https://github.com/crate-ci/typos/commit/fcfade456a71a13e7c3495a3d3055af176e1a55e";>fcfade4 Merge pull request https://redirect.github.com/crate-ci/typos/issues/1095";>#1095 from zyf722/actions-windows https://github.com/crate-ci/typos/commit/264f549c13a4c604d08542635a846f750d512b6b";>264f549 feat(action): Add Windows support to actions https://github.com/crate-ci/typos/commit/853bbe8898c2e3ce040ba6fb0b2cbb3d0ababec5";>853bbe8 chore: Release https://github.com/crate-ci/typos/commit/b5f56600b2fce35c214fc337a4884c4eafa6407a";>b5f5660 docs: Update changelog https://github.com/crate-ci/typos/commit/39f678e520527f50a251775ee38731def724d591";>39f678e Merge pull request https://redirect.github.com/crate-ci/typos/issues/1093";>#1093 from kachick/grather-suggest-candidates https://github.com/crate-ci/typos/commit/bb6905fdc963da291c50ef79072a8cbccac06e81";>bb6905f Merge pull request https://redirect.github.com/crate-ci/typos/issues/1092";>#1092 from crate-ci/renovate/embarkstudios-cargo-deny-acti... https://github.com/crate-ci/typos/commit/786c825f1753f87b02d24770e3f4ec8043d9084a";>786c825 fix(dict): Add suggestions for the typo "grather" https://github.com/crate-ci/typos/commit/340058181bc154efb6751badb34d4e40285d654b";>3400581 chore(deps): Update EmbarkStudios/cargo-deny-action action to v2 See full diff in https://github.com/crate-ci/typos/compare/v1.24.3...v1.24.5";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] chore(deps): Update arrow-array requirement from 52 to 53 [iceberg-rust]
dependabot[bot] opened a new pull request, #617: URL: https://github.com/apache/iceberg-rust/pull/617 Updates the requirements on [arrow-array](https://github.com/apache/arrow-rs) to permit the latest version. Changelog Sourced from https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md";>arrow-array's changelog. https://github.com/apache/arrow-rs/tree/52.2.0";>52.2.0 (2024-07-24) https://github.com/apache/arrow-rs/compare/52.1.0...52.2.0";>Full Changelog Implemented enhancements: Faster min/max for string/binary view arrays https://redirect.github.com/apache/arrow-rs/issues/6088";>#6088 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Support casting to/from Utf8View https://redirect.github.com/apache/arrow-rs/issues/6076";>#6076 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Min/max support for String/BinaryViewArray https://redirect.github.com/apache/arrow-rs/issues/6052";>#6052 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Improve performance of constructing ByteViews for small strings https://redirect.github.com/apache/arrow-rs/issues/6034";>#6034 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Fast UTF-8 validation when reading StringViewArray from Parquet https://redirect.github.com/apache/arrow-rs/issues/5995";>#5995 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] Optimize StringView row decoding https://redirect.github.com/apache/arrow-rs/issues/5945";>#5945 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Implementing deduplicate / intern functionality for StringView https://redirect.github.com/apache/arrow-rs/issues/5910";>#5910 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Add FlightSqlServiceClient::new_from_inner https://redirect.github.com/apache/arrow-rs/pull/6003";>#6003 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] [https://github.com/apache/arrow-rs/labels/arrow-flight";>arrow-flight] (https://github.com/lewiszlw";>lewiszlw) Complete StringViewArray and BinaryViewArray parquet decoder: https://redirect.github.com/apache/arrow-rs/pull/6004";>#6004 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] (https://github.com/XiangpengHao";>XiangpengHao) Add begin/end_transaction methods in FlightSqlServiceClient https://redirect.github.com/apache/arrow-rs/pull/6026";>#6026 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] [https://github.com/apache/arrow-rs/labels/arrow-flight";>arrow-flight] (https://github.com/lewiszlw";>lewiszlw) Read Parquet statistics as arrow Arrays https://redirect.github.com/apache/arrow-rs/pull/6046";>#6046 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] (https://github.com/efredine";>efredine) Fixed bugs: Panic in ParquetMetadata::memory_size if no min/max set https://redirect.github.com/apache/arrow-rs/issues/6091";>#6091 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] BinaryViewArray doesn't roundtrip a single Some(&[]) through parquet https://redirect.github.com/apache/arrow-rs/issues/6086";>#6086 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] Parquet ColumnIndex for null columns is written even when statistics are disabled https://redirect.github.com/apache/arrow-rs/issues/6010";>#6010 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] Documentation updates: Fix typo in GenericByteViewArray documentation https://redirect.github.com/apache/arrow-rs/pull/6054";>#6054 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/progval";>progval) Minor: Improve parquet PageIndex documentation https://redirect.github.com/apache/arrow-rs/pull/6042";>#6042 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] (https://github.com/alamb";>alamb) Closed issues: Potential performance improvements for reading Parquet to StringViewArray/BinaryViewArray https://redirect.github.com/apache/arrow-rs/issues/5904";>#5904 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Merged pull requests: Faster GenericByteView construction https://redirect.github.com/apache/arrow-rs/pull/6102";>#6102 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/XiangpengHao";>XiangpengHao) Add benchmark to track byte-view construction performance https://redirect.github.com/apache/arrow-rs/pull/6101";>#6101 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] (https://github.com/XiangpengHao";>XiangpengHao) Optimize bool_or using max_boolean https://redirect.github.com/apache/arrow-rs/pull/6100";>#6100 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/simonvandel";>simonvandel) Optimize max_boolean by operating
[PR] chore(deps): Update parquet requirement from 52 to 53 [iceberg-rust]
dependabot[bot] opened a new pull request, #619: URL: https://github.com/apache/iceberg-rust/pull/619 Updates the requirements on [parquet](https://github.com/apache/arrow-rs) to permit the latest version. Changelog Sourced from https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md";>parquet's changelog. https://github.com/apache/arrow-rs/tree/52.2.0";>52.2.0 (2024-07-24) https://github.com/apache/arrow-rs/compare/52.1.0...52.2.0";>Full Changelog Implemented enhancements: Faster min/max for string/binary view arrays https://redirect.github.com/apache/arrow-rs/issues/6088";>#6088 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Support casting to/from Utf8View https://redirect.github.com/apache/arrow-rs/issues/6076";>#6076 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Min/max support for String/BinaryViewArray https://redirect.github.com/apache/arrow-rs/issues/6052";>#6052 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Improve performance of constructing ByteViews for small strings https://redirect.github.com/apache/arrow-rs/issues/6034";>#6034 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Fast UTF-8 validation when reading StringViewArray from Parquet https://redirect.github.com/apache/arrow-rs/issues/5995";>#5995 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] Optimize StringView row decoding https://redirect.github.com/apache/arrow-rs/issues/5945";>#5945 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Implementing deduplicate / intern functionality for StringView https://redirect.github.com/apache/arrow-rs/issues/5910";>#5910 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Add FlightSqlServiceClient::new_from_inner https://redirect.github.com/apache/arrow-rs/pull/6003";>#6003 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] [https://github.com/apache/arrow-rs/labels/arrow-flight";>arrow-flight] (https://github.com/lewiszlw";>lewiszlw) Complete StringViewArray and BinaryViewArray parquet decoder: https://redirect.github.com/apache/arrow-rs/pull/6004";>#6004 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] (https://github.com/XiangpengHao";>XiangpengHao) Add begin/end_transaction methods in FlightSqlServiceClient https://redirect.github.com/apache/arrow-rs/pull/6026";>#6026 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] [https://github.com/apache/arrow-rs/labels/arrow-flight";>arrow-flight] (https://github.com/lewiszlw";>lewiszlw) Read Parquet statistics as arrow Arrays https://redirect.github.com/apache/arrow-rs/pull/6046";>#6046 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] (https://github.com/efredine";>efredine) Fixed bugs: Panic in ParquetMetadata::memory_size if no min/max set https://redirect.github.com/apache/arrow-rs/issues/6091";>#6091 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] BinaryViewArray doesn't roundtrip a single Some(&[]) through parquet https://redirect.github.com/apache/arrow-rs/issues/6086";>#6086 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] Parquet ColumnIndex for null columns is written even when statistics are disabled https://redirect.github.com/apache/arrow-rs/issues/6010";>#6010 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] Documentation updates: Fix typo in GenericByteViewArray documentation https://redirect.github.com/apache/arrow-rs/pull/6054";>#6054 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/progval";>progval) Minor: Improve parquet PageIndex documentation https://redirect.github.com/apache/arrow-rs/pull/6042";>#6042 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] (https://github.com/alamb";>alamb) Closed issues: Potential performance improvements for reading Parquet to StringViewArray/BinaryViewArray https://redirect.github.com/apache/arrow-rs/issues/5904";>#5904 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Merged pull requests: Faster GenericByteView construction https://redirect.github.com/apache/arrow-rs/pull/6102";>#6102 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/XiangpengHao";>XiangpengHao) Add benchmark to track byte-view construction performance https://redirect.github.com/apache/arrow-rs/pull/6101";>#6101 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] (https://github.com/XiangpengHao";>XiangpengHao) Optimize bool_or using max_boolean https://redirect.github.com/apache/arrow-rs/pull/6100";>#6100 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/simonvandel";>simonvandel) Optimize max_boolean by operating on u64
Re: [PR] feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate [iceberg-rust]
a-agmon commented on code in PR #588: URL: https://github.com/apache/iceberg-rust/pull/588#discussion_r1749162764 ## crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs: ## @@ -0,0 +1,312 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::VecDeque; + +use datafusion::common::tree_node::{TreeNodeRecursion, TreeNodeVisitor}; +use datafusion::common::Column; +use datafusion::error::DataFusionError; +use datafusion::logical_expr::{Expr, Operator}; +use datafusion::scalar::ScalarValue; +use iceberg::expr::{Predicate, Reference}; +use iceberg::spec::Datum; + +pub struct ExprToPredicateVisitor { +stack: VecDeque>, +} +impl ExprToPredicateVisitor { +/// Create a new predicate conversion visitor. +pub fn new() -> Self { +Self { +stack: VecDeque::new(), +} +} +/// Get the predicate from the stack. +pub fn get_predicate(&self) -> Option { +self.stack +.iter() +.filter_map(|opt| opt.clone()) +.reduce(Predicate::and) +} + +/// Convert a column expression to an iceberg predicate. +fn convert_column_expr( +&self, +col: &Column, +op: &Operator, +lit: &ScalarValue, +) -> Option { +let reference = Reference::new(col.name.clone()); +let datum = scalar_value_to_datum(lit)?; +Some(binary_op_to_predicate(reference, op, datum)) +} + +/// Convert a compound expression to an iceberg predicate. +/// +/// The strategy is to support the following cases: +/// - if its an AND expression then the result will be the valid predicates, whether there are 2 or just 1 +/// - if its an OR expression then a predicate will be returned only if there are 2 valid predicates on both sides +fn convert_compound_expr(&self, valid_preds: &[Predicate], op: &Operator) -> Option { +let valid_preds_count = valid_preds.len(); +match (op, valid_preds_count) { +(Operator::And, 1) => valid_preds.first().cloned(), +(Operator::And, 2) => Some(Predicate::and( +valid_preds[0].clone(), +valid_preds[1].clone(), +)), +(Operator::Or, 2) => Some(Predicate::or( +valid_preds[0].clone(), +valid_preds[1].clone(), +)), +_ => None, +} +} +} + +// Implement TreeNodeVisitor for ExprToPredicateVisitor +impl<'n> TreeNodeVisitor<'n> for ExprToPredicateVisitor { +type Node = Expr; + +fn f_down(&mut self, _node: &'n Self::Node) -> Result { +Ok(TreeNodeRecursion::Continue) +} + +fn f_up(&mut self, node: &'n Self::Node) -> Result { +if let Expr::BinaryExpr(binary) = node { +match (&*binary.left, &binary.op, &*binary.right) { +// process simple expressions (involving a column, operator and literal) +(Expr::Column(col), op, Expr::Literal(lit)) => { Review Comment: Thanks for the comment @liurenjie1024 I have added support, and test, for both "directions" of binary statement: `X > 1` and `1 < X`. But I think there is something more to be clarified: Both `x>1` and `1 1` is processed then there are 3 main leafs to handle: ``` 1 - Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" }) ``` This ^ will be ignored as its not a binary expr (its a column expr) ``` 2 - Literal(Int64(1)) ``` This ^ will also be ignored as its not a binary expr (its a literal expr) ``` 3 - BinaryExpr(BinaryExpr { left: Column(Column { relation: Some(Bare { table: "my_table" }), name: "foo" }), op: Gt, right: Literal(Int64(1)) }) ``` This ^ binary expr will be captured, deconstructed, and processed by this match arm ``` match (&*binary.left, &binary.op, &*binary.right) { (Expr::Column(col), op, Expr::Literal(lit)) => { ... } ``` Therefore, in order to support `a>6` or `6
Re: [PR] Iceberg/Comet integration POC [iceberg]
manuzhang commented on PR #9841: URL: https://github.com/apache/iceberg/pull/9841#issuecomment-2336715251 As a newbie to the native land, I'm wondering why we integrate with Comet parquet reader here. Any alternatives? Is there an "official" parquet reader? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: SortOrder methods should take schema ref if possible [iceberg-rust]
Xuanwo commented on code in PR #613: URL: https://github.com/apache/iceberg-rust/pull/613#discussion_r1749260477 ## crates/iceberg/src/spec/sort.rs: ## @@ -133,6 +133,14 @@ impl SortOrder { pub fn is_unsorted(&self) -> bool { self.fields.is_empty() } + +/// Set the order id for the sort order +pub fn with_order_id(&self, order_id: i64) -> SortOrder { Review Comment: An API named `with_xxx` that takes a ref might be a bit confusing. Is there any context for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] fix: SIGSEGV when describe empty table [iceberg-go]
alex-kar opened a new pull request, #145: URL: https://github.com/apache/iceberg-go/pull/145 Empty table does not have "Current Snapshot" causing describe command to be failed. Added validation to replace with empty string `""`. Steps to reproduce 1. Start env `docker-compose -f dev/docker-compose.yml up -d` 2. Create namespace ``` curl --location --request POST 'localhost:8181/v1/namespaces/public/tables' \ --header 'Content-Type: application/json' \ --data-raw '{ "schema": { "type":"struct", "fields": [ { "id":1, "name": "col1", "type": "boolean", "required": true } ] }, "name": "t" } ' ``` 3. Describe table `t`: `iceberg describe public.t --uri=http://localhost:8181` Actual result ``` Table format version | 2 Metadata location| s3://warehouse/public/t/metadata/0-c2e47505-8693-4254-b335-437111d400f0.metadata.json Table UUID | 552a9f6b-c96e-4816-957c-160b96d57a62 Last updated | 1725807593801 Sort Order | 0: [] Partition Spec | [] Current Schema, id=0 └──1: col1: required boolean panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x925504] goroutine 1 [running]: main.text.DescribeTable({}, 0xc0001740a0) /home/alex/projects/opensource/iceberg-go/cmd/iceberg/output.go:101 +0xea4 main.describe({0xb3dfe8, 0xf85660}, {0xb3e0c8, 0xc000362ed0}, {0x7ffe86519d7c, 0x8}, {0xa44687, 0x3}) /home/alex/projects/opensource/iceberg-go/cmd/iceberg/main.go:252 +0x3db main.main() /home/alex/projects/opensource/iceberg-go/cmd/iceberg/main.go:149 +0x4ef ``` Fixed result ``` Table format version | 2 Metadata location| s3://warehouse/public/t/metadata/0-c2e47505-8693-4254-b335-437111d400f0.metadata.json Table UUID | 552a9f6b-c96e-4816-957c-160b96d57a62 Last updated | 1725807593801 Sort Order | 0: [] Partition Spec | [] Current Schema, id=0 └──1: col1: required boolean Current Snapshot | Snapshots Properties key | value --- write.parquet.compression-codec | zstd ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: partition compatibility [iceberg-rust]
c-thiel commented on code in PR #612: URL: https://github.com/apache/iceberg-rust/pull/612#discussion_r1749274990 ## crates/iceberg/src/spec/partition.rs: ## @@ -118,9 +118,63 @@ impl PartitionSpec { /// Turn this partition spec into an unbound partition spec. /// /// The `field_id` is retained as `partition_id` in the unbound partition spec. -pub fn to_unbound(self) -> UnboundPartitionSpec { +pub fn into_unbound(self) -> UnboundPartitionSpec { self.into() } + +/// Check if this partition spec is compatible with another partition spec. +/// +/// Returns true if the partition spec is equal to the other spec with partition field ids ignored and +/// spec_id ignored. The following must be identical: +/// * The number of fields +/// * Field order +/// * Field names +/// * Source column ids +/// * Transforms +pub fn compatible_with(&self, other: &UnboundPartitionSpec) -> bool { Review Comment: Sounds good - done in https://github.com/apache/iceberg-rust/pull/612/commits/64303ea21a26708b873a62f54a43b4d1e2265f94 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: partition compatibility [iceberg-rust]
Xuanwo merged PR #612: URL: https://github.com/apache/iceberg-rust/pull/612 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Feat: Normalize TableMetadata [iceberg-rust]
Xuanwo commented on code in PR #611: URL: https://github.com/apache/iceberg-rust/pull/611#discussion_r1749292893 ## crates/iceberg/src/spec/table_metadata.rs: ## @@ -278,6 +330,229 @@ impl TableMetadata { self.snapshots .insert(snapshot.snapshot_id(), Arc::new(snapshot)); } + +/// Normalize this partition spec. +/// +/// This is an internal method +/// meant to be called after constructing table metadata from untrusted sources. +/// We run this method after json deserialization. +/// All constructors for `TableMetadata` which are part of `iceberg-rust` +/// should return normalized `TableMetadata`. +/// +/// It does: +/// * Validate the current schema is set and valid +/// * Validate that all refs are valid (snapshot exists) +/// * Validate logs are chronological +/// * Normalize location (remove trailing slash) +/// * Validate that for V1 Metadata the last_sequence_number is 0 +/// * If the default partition spec is specified but the spec is not present in specs, add it +/// * If the default sort order is unsorted but the sort order is not present, add it +pub(super) fn try_normalize(&mut self) -> Result<&mut Self> { +self.validate_current_schema()?; +self.normalize_current_snapshot()?; +self.validate_refs()?; +self.validate_chronological_snapshot_logs()?; +self.validate_chronological_metadata_logs()?; +self.location = self.location.trim_end_matches('/').to_string(); +self.validate_format_version_specifics()?; +self.try_normalize_partition_spec()?; +self.try_normalize_sort_order()?; +Ok(self) +} + +fn try_normalize_partition_spec(&mut self) -> Result<()> { Review Comment: Hi, I'm not sure if this makes the logic easier to read. (It works for me). ```rust if self.partition_spec_by_id(self.default_spec_id).is_some() { return Ok(()) } if self.default_spec_id != DEFAULT_PARTITION_SPEC_ID { return Err(...) } let partition_spec = xxx; self.partition_specs.insert(xxx); Ok(()) ``` The `self.partition_specs.is_empty()` check can be avoided since it should never occur. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark Streaming Job with multiple queries MERGE INTO the same target table (Runtime file filtering is not possible) [iceberg]
eric-maynard commented on issue #11094: URL: https://github.com/apache/iceberg/issues/11094#issuecomment-2336763248 To use multiple sessions you would essentially move this part of your code into the `start_streaming_query` function: ``` # Initialize Spark Session spark = ( SparkSession.builder .appName("streaming merg into") .config("spark.sql.shuffle.partitions", "2") .config("spark.log.level", "ERROR") .getOrCreate() ) ``` You might also want to make some changes such as appending the thread ID to the app name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] add option to force columns to lowercase [iceberg]
yornstei opened a new pull request, #11100: URL: https://github.com/apache/iceberg/pull/11100 The current connector will create the columns with the same casing as the incoming data, even though many catalogs will show the columns as lowercase. This PR will give the option to force columns to lowercase. Refer to this [slack thread](https://apache-iceberg.slack.com/archives/C03SEK9PW0Z/p1725468165932919) for more details. cc @bryanck -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Reassign field ids for schema [iceberg-rust]
Fokko commented on code in PR #615: URL: https://github.com/apache/iceberg-rust/pull/615#discussion_r1749313279 ## crates/iceberg/src/spec/schema.rs: ## @@ -86,6 +87,16 @@ impl SchemaBuilder { self } +/// Reassign all field-ids (nested) on build. +/// If `start_from` is provided, it will start reassigning from that id (inclusive). +/// If not provided, it will start from 0. +/// +/// All specified aliases and identifier fields will be updated to the new field-ids. +pub fn with_reassigned_field_ids(mut self, start_from: Option) -> Self { +self.reassign_field_ids_from = Some(start_from.unwrap_or(0)); Review Comment: Typically you would pass in `last-column-id` from the table metadata. In the case of a `CREATE TABLE` this will be zero, in the case of a `REPLACE TABLE` it will be `last-column-id` where all columns get a new ID and you ensure they are not used before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] storage/s3: We should respect `client.region` too [iceberg-rust]
jdockerty commented on issue #603: URL: https://github.com/apache/iceberg-rust/issues/603#issuecomment-2336810234 I can do this 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3,3.4: Add back log making it clearer when we are not pushing down filters [iceberg]
github-actions[bot] closed pull request #7713: Spark 3.3,3.4: Add back log making it clearer when we are not pushing down filters URL: https://github.com/apache/iceberg/pull/7713 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: IcebergSource extends SessionConfigSupport [iceberg]
github-actions[bot] commented on PR #7732: URL: https://github.com/apache/iceberg/pull/7732#issuecomment-2336880892 This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: Correct the two-stage parsing strategy of antlr parser [iceberg]
github-actions[bot] commented on PR #7734: URL: https://github.com/apache/iceberg/pull/7734#issuecomment-2336880932 This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: Correct the two-stage parsing strategy of antlr parser [iceberg]
github-actions[bot] closed pull request #7734: Spark 3.4: Correct the two-stage parsing strategy of antlr parser URL: https://github.com/apache/iceberg/pull/7734 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: Allow write mode (copy-on-write/merge-on-read) to be specified in SQLConf [iceberg]
github-actions[bot] closed pull request #7790: Spark 3.4: Allow write mode (copy-on-write/merge-on-read) to be specified in SQLConf URL: https://github.com/apache/iceberg/pull/7790 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: Allow write mode (copy-on-write/merge-on-read) to be specified in SQLConf [iceberg]
github-actions[bot] commented on PR #7790: URL: https://github.com/apache/iceberg/pull/7790#issuecomment-2336880984 This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Disable containerReuse when building delete filters [iceberg]
github-actions[bot] commented on PR #7833: URL: https://github.com/apache/iceberg/pull/7833#issuecomment-2336881035 This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: support use-table-distribution-and-ordering in session conf [iceberg]
github-actions[bot] commented on PR #8164: URL: https://github.com/apache/iceberg/pull/8164#issuecomment-2336881202 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Case sensitivity is not respected when using IcebergGenerics.ScanBuilder [iceberg]
github-actions[bot] commented on issue #8178: URL: https://github.com/apache/iceberg/issues/8178#issuecomment-2336881273 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support metadata compaction [iceberg-python]
github-actions[bot] closed issue #270: Support metadata compaction URL: https://github.com/apache/iceberg-python/issues/270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support metadata compaction [iceberg-python]
github-actions[bot] commented on issue #270: URL: https://github.com/apache/iceberg-python/issues/270#issuecomment-2336882412 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore(deps): Update arrow-array requirement from 52 to 53 [iceberg-rust]
dependabot[bot] commented on PR #617: URL: https://github.com/apache/iceberg-rust/pull/617#issuecomment-2336926987 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. You can also ignore all major, minor, or patch releases for a dependency by adding an [`ignore` condition](https://docs.github.com/en/code-security/supply-chain-security/configuration-options-for-dependency-updates#ignore) with the desired `update_types` to your config file. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore(deps): Update arrow-ord requirement from 52 to 53 [iceberg-rust]
dependabot[bot] commented on PR #621: URL: https://github.com/apache/iceberg-rust/pull/621#issuecomment-2336927231 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. You can also ignore all major, minor, or patch releases for a dependency by adding an [`ignore` condition](https://docs.github.com/en/code-security/supply-chain-security/configuration-options-for-dependency-updates#ignore) with the desired `update_types` to your config file. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore(deps): Update arrow-arith requirement from 52 to 53 [iceberg-rust]
dependabot[bot] commented on PR #620: URL: https://github.com/apache/iceberg-rust/pull/620#issuecomment-2336927697 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting `@dependabot ignore this major version` or `@dependabot ignore this minor version`. You can also ignore all major, minor, or patch releases for a dependency by adding an [`ignore` condition](https://docs.github.com/en/code-security/supply-chain-security/configuration-options-for-dependency-updates#ignore) with the desired `update_types` to your config file. If you change your mind, just re-open this PR and I'll resolve any conflicts on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] storage/s3: We should respect `client.region` too [iceberg-rust]
Xuanwo closed issue #603: storage/s3: We should respect `client.region` too URL: https://github.com/apache/iceberg-rust/issues/603 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix: Correctly calculate highest_field_id in schema [iceberg-rust]
Xuanwo merged PR #590: URL: https://github.com/apache/iceberg-rust/pull/590 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Iceberg/Comet integration POC [iceberg]
huaxingao commented on PR #9841: URL: https://github.com/apache/iceberg/pull/9841#issuecomment-2337062384 @manuzhang We have native Parquet reader on Comet side to take advantage of the performance gain on native side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] TableMetadataBuilder [iceberg-rust]
liurenjie1024 commented on PR #587: URL: https://github.com/apache/iceberg-rust/pull/587#issuecomment-2337063983 > After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in table_metadata.rs from the actual builder would be a leaner option than splitting every single TableUpdate? That sound reasonable to me. If one pr per table update is too much burden, could we split them by components, for example sort oder, partition spec, schema changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support relative paths in Table Metadata [iceberg]
lightmelodies commented on issue #1617: URL: https://github.com/apache/iceberg/issues/1617#issuecomment-2337073064 For anyone who is interested: https://github.com/lightmelodies/iceberg-relative-io -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]
stevenzwu commented on code in PR #11086: URL: https://github.com/apache/iceberg/pull/11086#discussion_r1749541033 ## core/src/test/java/org/apache/iceberg/TestFastAppend.java: ## @@ -42,6 +42,25 @@ protected static List parameters() { return Arrays.asList(1, 2, 3); } + @TestTemplate + public void testAddManyFiles() { +assertThat(listManifestFiles()).as("Table should start empty").isEmpty(); + +List dataFiles = Lists.newArrayList(); + +for (int ordinal = 0; ordinal < 20_000; ordinal++) { Review Comment: I am usually good with package private. we have quite a bit `@VisibleForTesting` for case like that. It can be tricky to update the tests if we ever change the `MIN_FILE_GROUP_SIZE` default value to some larger value like 20_000. then this test is not covering the intended case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Iceberg/Comet integration POC [iceberg]
manuzhang commented on PR #9841: URL: https://github.com/apache/iceberg/pull/9841#issuecomment-2337144632 @huaxingao I know that. What about alternatives? Is the Comet parquet reader the only native parquet reader implementation? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Fix rendering of partition stats file spec [iceberg]
nastra merged PR #11068: URL: https://github.com/apache/iceberg/pull/11068 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org