Re: [PR] Flink - Fix incorrect / old row being written into delta files when using upsert mode [iceberg]
pvary commented on code in PR #4364: URL: https://github.com/apache/iceberg/pull/4364#discussion_r1744988793 ## flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java: ## @@ -74,7 +80,7 @@ public void write(RowData row) throws IOException { case INSERT: case UPDATE_AFTER: if (upsert) { - writer.delete(row); + writer.deleteKey(deleteSchemaProjection.wrap(row)); Review Comment: There is no way to distinguish between why the equality delete was written. Especially as the equality delete could have been written by other engines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Why call deleteKey for Insert and Update After in Flink BaseDeltaTaskWriter? [iceberg]
SML0127 commented on issue #11081: URL: https://github.com/apache/iceberg/issues/11081#issuecomment-2330951529 @pvary Thk for answer pvary. Now I'm checking some code in ColumnarBatchReader, whether it works as follow: 1. apply eq delete files first, from oldest snapshot 2. then apply data file and pos delete files 3. reapply from step 1 for the next snapshot. I'm still looking into it so it could be wrong. ```java ColumnarBatch loadDataToColumnBatch() { int numRowsUndeleted = initRowIdMapping(); ColumnVector[] arrowColumnVectors = readDataToColumnVectors(); ColumnarBatch newColumnarBatch = new ColumnarBatch(arrowColumnVectors); newColumnarBatch.setNumRows(numRowsUndeleted); if (hasEqDeletes()) { // eq 있는지 한번더 확인? applyEqDelete(newColumnarBatch); } if (hasIsDeletedColumn && rowIdMapping != null) { // reset the row id mapping array, so that it doesn't filter out the deleted rows for (int i = 0; i < numRowsToRead; i++) { rowIdMapping[i] = i; } newColumnarBatch.setNumRows(numRowsToRead); } return newColumnarBatch; } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink: multiple sinks for the different iceberg tables in the same job? [iceberg]
chenwyi2 closed issue #11074: Flink: multiple sinks for the different iceberg tables in the same job? URL: https://github.com/apache/iceberg/issues/11074 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink: multiple sinks for the different iceberg tables in the same job? [iceberg]
chenwyi2 commented on issue #11074: URL: https://github.com/apache/iceberg/issues/11074#issuecomment-2331046697 yes that's no iceberg problem, but with our platform, thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Caused by: java.lang.RuntimeException: org.apache.flink.runtime.JobException: Creating the input splits caused an error: Failed to refresh the table [iceberg]
Littlehhao opened a new issue, #11082: URL: https://github.com/apache/iceberg/issues/11082 ### Query engine _No response_ ### Question environment: flink-standlone:1.17.1 hadoop 3.1.0 select * from sample; Caused by: org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. at org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) ~[flink-dist-1.17.1.jar:1.17.1] at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_422] at java.lang.Thread.run(Unknown Source) ~[?:1.8.0_422] Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.JobException: Creating the input splits caused an error: Failed to refresh the table at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_422] at java.lang.Thread.run(Unknown Source) ~[?:1.8.0_422] Caused by: java.lang.RuntimeException: org.apache.flink.runtime.JobException: Creating the input splits caused an error: Failed to refresh the table at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321) ~[flink-dist-1.17.1.jar:1.17.1] at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:114) ~[flink-dist-1.17.1.jar:1.17.1] at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_422] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_422] at java.lang.Thread.run(Unknown Source) ~[?:1.8.0_422] Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: Failed to refresh the table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Include third party licenses and notices in distribution [iceberg]
ajantha-bhat commented on code in PR #10829: URL: https://github.com/apache/iceberg/pull/10829#discussion_r1745249956 ## kafka-connect/kafka-connect-runtime/NOTICE: ## @@ -0,0 +1,1723 @@ + +Apache Iceberg +Copyright 2017-2024 The Apache Software Foundation + +This product includes software developed at +The Apache Software Foundation (http://www.apache.org/). + + + +This project includes code from Kite, developed at Cloudera, Inc. with +the following copyright notice: + +| Copyright 2013 Cloudera Inc. +| +| Licensed under the Apache License, Version 2.0 (the "License"); +| you may not use this file except in compliance with the License. +| You may obtain a copy of the License at +| +| http://www.apache.org/licenses/LICENSE-2.0 +| +| Unless required by applicable law or agreed to in writing, software +| distributed under the License is distributed on an "AS IS" BASIS, +| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +| See the License for the specific language governing permissions and +| limitations under the License. + + + +This binary artifact contains code from the following projects: + + + +Group: org.apache.commons Name: commons-math3 Version: 3.1.1 + +Notice: Apache Commons Math +Copyright 2001-2012 The Apache Software Foundation + +This product includes software developed by +The Apache Software Foundation (http://www.apache.org/). + +=== + +The BracketFinder (package org.apache.commons.math3.optimization.univariate) Review Comment: I don't see any direct dependency on this. I am not even sure this is a right way to generate notice and license. Because it looks so much different from spark runtime for example. Maybe we need to add how to generate notice and license for modules in `contribute.md` first (also need to mention when do we need to add it, like only for runtime jar modules) and then maybe follow that process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Include third party licenses and notices in distribution [iceberg]
ajantha-bhat commented on code in PR #10829: URL: https://github.com/apache/iceberg/pull/10829#discussion_r1745249956 ## kafka-connect/kafka-connect-runtime/NOTICE: ## @@ -0,0 +1,1723 @@ + +Apache Iceberg +Copyright 2017-2024 The Apache Software Foundation + +This product includes software developed at +The Apache Software Foundation (http://www.apache.org/). + + + +This project includes code from Kite, developed at Cloudera, Inc. with +the following copyright notice: + +| Copyright 2013 Cloudera Inc. +| +| Licensed under the Apache License, Version 2.0 (the "License"); +| you may not use this file except in compliance with the License. +| You may obtain a copy of the License at +| +| http://www.apache.org/licenses/LICENSE-2.0 +| +| Unless required by applicable law or agreed to in writing, software +| distributed under the License is distributed on an "AS IS" BASIS, +| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +| See the License for the specific language governing permissions and +| limitations under the License. + + + +This binary artifact contains code from the following projects: + + + +Group: org.apache.commons Name: commons-math3 Version: 3.1.1 + +Notice: Apache Commons Math +Copyright 2001-2012 The Apache Software Foundation + +This product includes software developed by +The Apache Software Foundation (http://www.apache.org/). + +=== + +The BracketFinder (package org.apache.commons.math3.optimization.univariate) Review Comment: I don't see any direct dependency on this. I am not even sure this is a right way to generate notice and license. Because it looks so much different from spark runtime for example. Maybe we need to add how to generate notice and license for modules in "contribute.md" first (also need to mention when do we need to add it, like only for runtime jar modules) and then maybe follow that process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Why call deleteKey for Insert and Update After in Flink BaseDeltaTaskWriter? [iceberg]
SML0127 commented on issue #11081: URL: https://github.com/apache/iceberg/issues/11081#issuecomment-2331421137 @pvary Sorry for asking a vague question. But thx to your answer, now I understand how data file and delete file works. That is, - equality delete files are valid only up to the previous snapshot (applied to previous snapshot) - pos delete files are valid up to the current snapshot (applied to current(latest) snapshot) Please check if I understand corrently🙏🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support projection pushdown for datafusion iceberg [iceberg-rust]
FANNG1 commented on code in PR #594: URL: https://github.com/apache/iceberg-rust/pull/594#discussion_r1745529020 ## crates/integrations/datafusion/src/physical_plan/scan.rs: ## @@ -138,3 +156,18 @@ async fn get_batch_stream( Ok(Box::pin(stream)) } + +fn get_column_names(schema: ArrowSchemaRef, projection: Option<&Vec>) -> Vec { Review Comment: done ## crates/integrations/datafusion/src/physical_plan/scan.rs: ## @@ -116,7 +125,11 @@ impl DisplayAs for IcebergTableScan { _t: datafusion::physical_plan::DisplayFormatType, f: &mut std::fmt::Formatter, ) -> std::fmt::Result { -write!(f, "IcebergTableScan") +write!( +f, +"IcebergTableScan projection:[{}]", +self.projection.join(" ") Review Comment: done ## crates/examples/src/datafusion_read_data.rs: ## Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support projection pushdown for datafusion iceberg [iceberg-rust]
FANNG1 commented on PR #594: URL: https://github.com/apache/iceberg-rust/pull/594#issuecomment-2331731075 @liurenjie1024 , all comments are addressed, please help to review again, thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Docs on configuring the sink [iceberg]
bryanck commented on code in PR #10746: URL: https://github.com/apache/iceberg/pull/10746#discussion_r1745578149 ## docs/docs/kafka-connect.md: ## @@ -0,0 +1,352 @@ +--- +title: "Kafka Connect" +--- + + +# Kafka Connect + +[Kafka Connect](https://docs.confluent.io/platform/current/connect/index.html) is a popular framework for moving data +in and out of Kafka via connectors. There are many different different connectors available, such as the S3 sink +for writing data from Kafka to S3 and Debezium source connectors for writing change data capture records from relational +databases to Kafka. + +It has a straightforward, decentralized, distributed architecture. A cluster consists of a number of worker processes, +and a connector runs tasks on these processes to perform the work. Connector deployment is configuration driven, so +generally no code needs to be written to run a connector. + +## Apache Iceberg Sink Connector + +The Apache Iceberg Sink Connector for Kafka Connect is a sink connector for writing data from Kafka into Iceberg tables. + +## Features + +* Commit coordination for centralized Iceberg commits +* Exactly-once delivery semantics +* Multi-table fan-out +* Automatic table creation and schema evolution +* Field name mapping via Iceberg’s column mapping functionality + +## Installation + +The connector zip archive is created as part of the Iceberg build. You can run the build via: +```bash +./gradlew -xtest -xintegrationTest clean build +``` +The zip archive will be found under `./kafka-connect/kafka-connect-runtime/build/distributions`. There is +one distribution that bundles the Hive Metastore client and related dependencies, and one that does not. +Copy the distribution archive into the Kafka Connect plugins directory on all nodes. + +## Requirements + +The sink relies on [KIP-447](https://cwiki.apache.org/confluence/display/KAFKA/KIP-447%3A+Producer+scalability+for+exactly+once+semantics) +for exactly-once semantics. This requires Kafka 2.5 or later. + +## Configuration + +| Property | Description | +||--| +| iceberg.tables | Comma-separated list of destination tables | +| iceberg.tables.dynamic-enabled | Set to `true` to route to a table specified in `routeField` instead of using `routeRegex`, default is `false`| +| iceberg.tables.route-field | For multi-table fan-out, the name of the field used to route records to tables | +| iceberg.tables.default-commit-branch | Default branch for commits, main is used if not specified | +| iceberg.tables.default-id-columns | Default comma-separated list of columns that identify a row in tables (primary key) | +| iceberg.tables.default-partition-by| Default comma-separated list of partition field names to use when creating tables | +| iceberg.tables.auto-create-enabled | Set to `true` to automatically create destination tables, default is `false` | +| iceberg.tables.evolve-schema-enabled | Set to `true` to add any missing record fields to the table schema, default is `false` | +| iceberg.tables.schema-force-optional | Set to `true` to set columns as optional during table create and evolution, default is `false` to respect schema | +| iceberg.tables.schema-case-insensitive | Set to `true` to look up table columns by case-insensitive name, default is `false` for case-sensitive | +| iceberg.tables.auto-create-props.* | Properties set on new tables during auto-create | +| iceberg.tables.write-props.* | Properties passed through to Iceberg writer initialization, these take precedence | +| iceberg.table.\.commit-branch | Table-specific branch for commits, use `iceberg.tables.default-commit-branch` if not specified | +| iceberg.table.\.id-columns| Comma-separated list of columns that identify a row in the table (primary key) | +| iceberg.table.\.partition-by | Comma-separated list of partition fields to use when creating the table | +| iceberg.table.\.route-regex | The regex used to match a record's `routeField` to a table | +| iceberg.control.topic
Re: [PR] Kafka Connect: increase timeout for integration test [iceberg]
bryanck commented on PR #11075: URL: https://github.com/apache/iceberg/pull/11075#issuecomment-2331804459 > Can we create a separate CI for kafka connect? Sure let me add that to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump cryptography from 43.0.0 to 43.0.1 [iceberg-python]
Fokko merged PR #1130: URL: https://github.com/apache/iceberg-python/pull/1130 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump mkdocs-material from 9.5.33 to 9.5.34 [iceberg-python]
Fokko merged PR #1126: URL: https://github.com/apache/iceberg-python/pull/1126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Make `commit_table` public [iceberg-python]
Fokko commented on PR #1112: URL: https://github.com/apache/iceberg-python/pull/1112#issuecomment-2331938676 @sungwy No problem at all, I've pulled in latest master 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] MR: iceberg storage handler should set common projection pruning config [iceberg]
ludlows closed pull request #10188: MR: iceberg storage handler should set common projection pruning config URL: https://github.com/apache/iceberg/pull/10188 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] open-api: Fix compile warnings for testFixtures [iceberg]
danielcweeks merged PR #11071: URL: https://github.com/apache/iceberg/pull/11071 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Mandate identifier fields when create_changelog_view for table contain unsortable columns [iceberg]
dramaticlly commented on code in PR #11045: URL: https://github.com/apache/iceberg/pull/11045#discussion_r1745898693 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java: ## @@ -146,10 +147,16 @@ public InternalRow[] call(InternalRow args) { Dataset df = loadRows(changelogTableIdent, options(input)); boolean netChanges = input.asBoolean(NET_CHANGES, false); +String[] identifierColumns = identifierColumns(input, tableIdent); +Preconditions.checkArgument( +identifierColumns.length > 0 +|| Arrays.stream(df.schema().fields()) +.allMatch(field -> OrderUtils.isOrderable(field.dataType())), +"Identifier field is required if table contains unorderable columns"); Review Comment: thank you @flyrain and @karuppayya and that's a good point to judge based on existing error message , added unorderable column name as suggested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3, 3.4: Parallelize reading files in migrate procedures [iceberg]
amogh-jahagirdar commented on PR #11043: URL: https://github.com/apache/iceberg/pull/11043#issuecomment-2332252827 Thanks @manuzhang! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3, 3.4: Parallelize reading files in migrate procedures [iceberg]
amogh-jahagirdar merged PR #11043: URL: https://github.com/apache/iceberg/pull/11043 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Rust <> Python integration point [iceberg-rust]
kevinjqliu commented on issue #538: URL: https://github.com/apache/iceberg-rust/issues/538#issuecomment-2332259085 Looks like @sungwy already started by exposing Transforms in #556 I'll take a stab at exposing the Catalogs, see https://github.com/apache/iceberg-rust/pull/534#issuecomment-2330489500 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support python 3.12 [iceberg-python]
Fokko commented on code in PR #1068: URL: https://github.com/apache/iceberg-python/pull/1068#discussion_r1745925508 ## pyproject.toml: ## @@ -611,6 +615,8 @@ filterwarnings = [ "ignore:unclosed
Re: [I] Add support for Python 3.12 [iceberg-python]
kevinjqliu closed issue #28: Add support for Python 3.12 URL: https://github.com/apache/iceberg-python/issues/28 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Include third party licenses and notices in distribution [iceberg]
danielcweeks commented on code in PR #10829: URL: https://github.com/apache/iceberg/pull/10829#discussion_r1745961589 ## kafka-connect/kafka-connect-runtime/NOTICE: ## @@ -0,0 +1,1723 @@ + +Apache Iceberg +Copyright 2017-2024 The Apache Software Foundation + +This product includes software developed at +The Apache Software Foundation (http://www.apache.org/). + + + +This project includes code from Kite, developed at Cloudera, Inc. with Review Comment: I'm not sure if this is right. Looking at the [Kite NOTICE](https://github.com/kite-sdk/kite/blob/master/NOTICE.txt), it has entries I don't see here. (I didn't check all dependencies. The first one appears to have an issue, so want to make sure this was generated correctly) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Mandate identifier fields when create_changelog_view for table contain unsortable columns [iceberg]
flyrain merged PR #11045: URL: https://github.com/apache/iceberg/pull/11045 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Mandate identifier fields when create_changelog_view for table contain unsortable columns [iceberg]
flyrain commented on PR #11045: URL: https://github.com/apache/iceberg/pull/11045#issuecomment-2332359477 Thanks @dramaticlly for the change, and thanks @karuppayya @anigos @huaxingao for the review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Include third party licenses and notices in distribution [iceberg]
bryanck commented on code in PR #10829: URL: https://github.com/apache/iceberg/pull/10829#discussion_r1745974177 ## kafka-connect/kafka-connect-runtime/NOTICE: ## @@ -0,0 +1,1723 @@ + +Apache Iceberg +Copyright 2017-2024 The Apache Software Foundation + +This product includes software developed at +The Apache Software Foundation (http://www.apache.org/). + + + +This project includes code from Kite, developed at Cloudera, Inc. with Review Comment: I just appended notices to the root project notice. I could remove that if we don't need it. That's true of the cloud bundle notices also. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Expose `Catalog` trait as python binding [iceberg-rust]
kevinjqliu opened a new pull request, #604: URL: https://github.com/apache/iceberg-rust/pull/604 #538 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] New PR label [ready for review] [iceberg-python]
sungwy commented on issue #1123: URL: https://github.com/apache/iceberg-python/issues/1123#issuecomment-2332404830 As a counter argument, what's the difference between using this label, versus opening a PR in Draft mode? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Cannot commit identity partition on datatypes time,timestamp* using 'fromPartitionString' [iceberg]
mderoy commented on issue #11085: URL: https://github.com/apache/iceberg/issues/11085#issuecomment-2332466946 We'll contribute on this in a few weeks when one of our developers gets back from vacation. https://github.com/apache/iceberg/commit/770f84325f2810bae3b48a6e1983d6a8135cb7bf seems like something that can guide us to an implementation, but if there is any other background information you may have that would be helpful we'd love to hear it :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Remove deprecated `datetime` functions [iceberg-python]
hussein-awala opened a new pull request, #1134: URL: https://github.com/apache/iceberg-python/pull/1134 closes: #1133 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Inconsistent row count across versions [iceberg-python]
dev-goyal commented on issue #1132: URL: https://github.com/apache/iceberg-python/issues/1132#issuecomment-2332547741 Thanks @sungwy , that makes sense to me - I am indeed using MOR (version 2), so this makes sense to me! Let me know how else I might be able to help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Core: Optimize writing metadata for many new files [iceberg]
aokolnychyi opened a new pull request, #11086: URL: https://github.com/apache/iceberg/pull/11086 This PR optimizes writing metadata for many new files, which is helpful during initial table creation and row-level operations that modify many files. Benchmark results prior to the changes: ``` Benchmark(fast) (numFiles) Mode Cnt Score Error Units AppendBenchmark.appendFilestrue 5ss5 1.111 ± 0.035 s/op AppendBenchmark.appendFilestrue 10ss5 2.114 ± 0.049 s/op AppendBenchmark.appendFilestrue 50ss5 10.144 ± 0.182 s/op AppendBenchmark.appendFilestrue 100ss5 20.205 ± 0.388 s/op AppendBenchmark.appendFilestrue 250ss5 51.280 ± 3.610 s/op AppendBenchmark.appendFiles false 5ss5 1.125 ± 0.084 s/op AppendBenchmark.appendFiles false 10ss5 2.107 ± 0.095 s/op AppendBenchmark.appendFiles false 50ss5 10.117 ± 0.398 s/op AppendBenchmark.appendFiles false 100ss5 20.350 ± 1.046 s/op AppendBenchmark.appendFiles false 250ss5 48.823 ± 3.604 s/op ``` Benchmark results after the changes: ``` Benchmark(fast) (numFiles) Mode Cnt Score Error Units AppendBenchmark.appendFilestrue 5ss5 0.383 ± 0.072 s/op AppendBenchmark.appendFilestrue 10ss5 0.435 ± 0.031 s/op AppendBenchmark.appendFilestrue 50ss5 1.283 ± 0.152 s/op AppendBenchmark.appendFilestrue 100ss5 2.325 ± 0.411 s/op AppendBenchmark.appendFilestrue 250ss5 5.615 ± 1.203 s/op AppendBenchmark.appendFiles false 5ss5 0.403 ± 0.023 s/op AppendBenchmark.appendFiles false 10ss5 0.484 ± 0.058 s/op AppendBenchmark.appendFiles false 50ss5 1.357 ± 0.190 s/op AppendBenchmark.appendFiles false 100ss5 2.596 ± 0.829 s/op AppendBenchmark.appendFiles false 250ss5 5.993 ± 1.447 s/op ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Optimize writing metadata for many new files [iceberg]
aokolnychyi commented on code in PR #11086: URL: https://github.com/apache/iceberg/pull/11086#discussion_r1746178142 ## core/src/main/java/org/apache/iceberg/SnapshotProducer.java: ## @@ -654,4 +740,38 @@ private static void updateTotal( } } } + + protected static class DeleteFileHolder { Review Comment: Simply moved from `MergingSnapshotProducer`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core/RewriteFiles: Duplicate Data Bug - Fixed dropping delete files that are still required [iceberg]
amogh-jahagirdar commented on code in PR #10962: URL: https://github.com/apache/iceberg/pull/10962#discussion_r1746200353 ## core/src/test/java/org/apache/iceberg/TestRewriteFiles.java: ## @@ -384,6 +386,116 @@ public void testRewriteDataAndAssignOldSequenceNumber() { assertThat(listManifestFiles()).hasSize(4); } + @TestTemplate + public void testRewriteDataAndAssignOldSequenceNumbersShouldNotDropDeleteFiles() { +assumeThat(formatVersion) +.as("Sequence number is only supported in iceberg format v2 or later") +.isGreaterThan(1); +assertThat(listManifestFiles()).isEmpty(); + +commit(table, table.newRowDelta().addRows(FILE_A).addDeletes(FILE_A2_DELETES), branch); + +long firstRewriteSequenceNumber = latestSnapshot(table, branch).sequenceNumber(); + +commit( +table, + table.newRowDelta().addRows(FILE_B).addRows(FILE_B).addDeletes(FILE_B2_DELETES), +branch); +commit( +table, + table.newRowDelta().addRows(FILE_B).addRows(FILE_C).addDeletes(FILE_C2_DELETES), +branch); + +long secondRewriteSequenceNumber = latestSnapshot(table, branch).sequenceNumber(); + +commit( +table, +table +.newRewrite() +.addFile(FILE_D) +.deleteFile(FILE_B) +.deleteFile(FILE_C) +.dataSequenceNumber(secondRewriteSequenceNumber), +branch); + +TableMetadata base = readMetadata(); +Snapshot baseSnap = latestSnapshot(base, branch); +long baseSnapshotId = baseSnap.snapshotId(); + +Comparator sequenceNumberOrdering = +new Comparator<>() { + @Override + public int compare(ManifestFile o1, ManifestFile o2) { +return (int) (o1.sequenceNumber() - o2.sequenceNumber()); + } +}; + +// FILE_B2_DELETES and FILE_A2_DELETES should not be removed as the rewrite specifies +// `firstRewriteSequenceNumber` +// explicitly which is the same as that of A2_DELETES and before B2_DELETES + +// Technically A1_DELETES could be removed since it's an equality delete and should apply on Review Comment: Sorry not sure I really follow the comment? There is no A1_DELETESthere's a FILE_A_DELETES but that's a positional delete. I don't see those referenced in the above operations. I think it's true though that we should be able to drop equality deletes older than the minimum sequence number but that's already happening in the existing MergingSnapshotProducer check no? Don't think anything needs to distinguish there between equality and positional delete ## core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java: ## @@ -833,7 +833,17 @@ public List apply(TableMetadata base, Snapshot snapshot) { filterManager.filterManifests( SnapshotUtil.schemaFor(base, targetBranch()), snapshot != null ? snapshot.dataManifests(ops.io()) : null); -long minDataSequenceNumber = + +long minNewFileSequenceNumber = +addedDataFiles().stream() +.filter(x -> x.dataSequenceNumber() != null && x.dataSequenceNumber() >= 0) +.mapToLong(ContentFile::dataSequenceNumber) +.reduce( +newDataFilesDataSequenceNumber != null +? newDataFilesDataSequenceNumber +: base.nextSequenceNumber(), +Math::min); Review Comment: Do we actually need to iterate through the `addedDataFiles`? If I understood the issue correctly, the problem is that it's possible for a user to commit a rewrite operation and specify an older data sequence number, and the current logic would drop delete files which actually need to still be referenced in the new commit since it's not considering the specified data file sequence number. So I *think* all we would need to do here is keep the existing logic for determining minDataSequenceNumber and then also min that with the `newDataFilesDataSequenceNumber` if it's not null ``` long minNewDataSequenceNumber = if (newDataFilesDataSequenceNumber != null) { minNewDataSequenceNumber = Math.min(minNewDataSequenceNumber, newDataFilesDataSequenceNumber); } ``` ## core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java: ## @@ -833,7 +833,17 @@ public List apply(TableMetadata base, Snapshot snapshot) { filterManager.filterManifests( SnapshotUtil.schemaFor(base, targetBranch()), snapshot != null ? snapshot.dataManifests(ops.io()) : null); -long minDataSequenceNumber = + +long minNewFileSequenceNumber = +addedDataFiles().stream() +.filter(x -> x.dataSequenceNumber() != null && x.dataSequenceNumber() >= 0) +.mapToLong(ContentFile::dataSequenceNumber) +.reduce( +newDataFilesDataSequenceNumber != null +?
Re: [I] Cannot commit identity partition on datatypes time,timestamp* using 'fromPartitionString' [iceberg]
amogh-jahagirdar commented on issue #11085: URL: https://github.com/apache/iceberg/issues/11085#issuecomment-2332705117 Hey @mderoy check out https://github.com/apache/iceberg/pull/10820 which tried to address this but we determined that adding to this Conversions logic is probably not the right way to go since Iceberg considers partition value to String a 1 way conversion. The other way existed for Hive table migration but again probably not good to keep adding cases to that since the string representation is not standardized. Before all that, I'm curious what you're trying to do, is this a Hive table migration to Iceberg? It doesn't sound like it based on ``` when trying to insert into a table partitioned on identity for a time datatype, we get the following error trying to commit Unsupported type for fromPartitionString: time ``` If you're inserting into an Iceberg table, this Conversions logic shouldn't even come into the picture so I'm a bit confused there... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] New PR label [ready for review] [iceberg-python]
kevinjqliu closed issue #1123: New PR label [ready for review] URL: https://github.com/apache/iceberg-python/issues/1123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Remove python 3.8 support [iceberg-python]
kevinjqliu commented on issue #1121: URL: https://github.com/apache/iceberg-python/issues/1121#issuecomment-2332722899 I'll draft it up :) thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deprecate ADLFS prefix in favor of ADLS [iceberg-python]
kevinjqliu commented on PR #961: URL: https://github.com/apache/iceberg-python/pull/961#issuecomment-2332733081 @ndrluis do you mind rebasing this PR? Looks like its almost good to go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Bump pydantic from 2.8.2 to 2.9.0 [iceberg-python]
dependabot[bot] opened a new pull request, #1137: URL: https://github.com/apache/iceberg-python/pull/1137 Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.8.2 to 2.9.0. Release notes Sourced from https://github.com/pydantic/pydantic/releases";>pydantic's releases. v2.9.0 (2024-09-05) The code released in v2.9.0 is practically identical to that of v2.9.0b2. Check out our https://pydantic.dev/articles/pydantic-v2-9-release";>blog post to learn more about the release highlights! What's Changed Packaging Bump ruff to v0.5.0 and pyright to v1.1.369 by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/9801";>#9801 Bump pydantic-extra-types to v2.9.0 by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/9832";>#9832 Support compatibility with pdm v2.18.1 by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/10138";>#10138 Bump v1 version stub to v1.10.18 by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10214";>#10214 Bump pydantic-core to v2.23.2 by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10311";>#10311 New Features Add support for ZoneInfo by https://github.com/Youssefares";>@Youssefares in https://redirect.github.com/pydantic/pydantic/pull/9896";>#9896 Add Config.val_json_bytes by https://github.com/josh-newman";>@josh-newman in https://redirect.github.com/pydantic/pydantic/pull/9770";>#9770 Add DSN for Snowflake by https://github.com/aditkumar72";>@aditkumar72 in https://redirect.github.com/pydantic/pydantic/pull/10128";>#10128 Support complex number by https://github.com/changhc";>@changhc in https://redirect.github.com/pydantic/pydantic/pull/9654";>#9654 Add support for annotated_types.Not by https://github.com/aditkumar72";>@aditkumar72 in https://redirect.github.com/pydantic/pydantic/pull/10210";>#10210 Allow WithJsonSchema to inject $refs w/ http or https links by https://github.com/dAIsySHEng1";>@dAIsySHEng1 in https://redirect.github.com/pydantic/pydantic/pull/9863";>#9863 Allow validators to customize validation JSON schema by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/10094";>#10094 Support parametrized PathLike types by https://github.com/nix010";>@nix010 in https://redirect.github.com/pydantic/pydantic/pull/9764";>#9764 Add tagged union serializer that attempts to use str or callable discriminators to select the correct serializer by https://github.com/sydney-runkle";>@sydney-runkle in in https://redirect.github.com/pydantic/pydantic-core/pull/1397";>pydantic/pydantic-core#1397 Changes Breaking Change: Merge dict type json_schema_extra by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/9792";>#9792 For more info (how to replicate old behavior) on this change, see https://docs.pydantic.dev/dev/concepts/json_schema/#merging-json_schema_extra";>here Refactor annotation injection for known (often generic) types by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/9979";>#9979 Move annotation compatibility errors to validation phase by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/";># Improve runtime errors for string constraints like pattern for incompatible types by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10158";>#10158 Remove 'allOf' JSON schema workarounds by https://github.com/dpeachey";>@dpeachey in https://redirect.github.com/pydantic/pydantic/pull/10029";>#10029 Remove typed_dict_cls data from CoreMetadata by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10180";>#10180 Deprecate passing a dict to the Examples class by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/10181";>#10181 Remove initial_metadata from internal metadata construct by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10194";>#10194 Use re.Pattern.search instead of re.Pattern.match for consistency with rust behavior by https://github.com/tinez";>@tinez in https://redirect.github.com/pydantic/pydantic-core/pull/1368";>pydantic/pydantic-core#1368 Show value of wrongly typed data in pydantic-core serialization warning by https://github.com/BoxyUwU";>@BoxyUwU in https://redirect.github.com/pydantic/pydantic-core/pull/1377";>pydantic/pydantic-core#1377 Breaking Change: in pydantic-core, change metadata type hint in core schemas from A
[PR] Bump sqlalchemy from 2.0.32 to 2.0.34 [iceberg-python]
dependabot[bot] opened a new pull request, #1138: URL: https://github.com/apache/iceberg-python/pull/1138 Bumps [sqlalchemy](https://github.com/sqlalchemy/sqlalchemy) from 2.0.32 to 2.0.34. Release notes Sourced from https://github.com/sqlalchemy/sqlalchemy/releases";>sqlalchemy's releases. 2.0.34 Released: September 4, 2024 orm [orm] [bug] Fixed regression caused by issue https://www.sqlalchemy.org/trac/ticket/11814";>#11814 which broke support for certain flavors of https://peps.python.org/pep-0593";>PEP 593 Annotated in the type_annotation_map when builtin types such as list, dict were used without an element type. While this is an incomplete style of typing, these types nonetheless previously would be located in the type_annotation_map correctly. References: https://www.sqlalchemy.org/trac/ticket/11831";>#11831 sqlite [sqlite] [bug] Fixed regression in SQLite reflection caused by https://www.sqlalchemy.org/trac/ticket/11677";>#11677 which interfered with reflection for CHECK constraints that were followed by other kinds of constraints within the same table definition. Pull request courtesy Harutaka Kawamura. References: https://www.sqlalchemy.org/trac/ticket/11832";>#11832 2.0.33 Released: September 3, 2024 general [general] [change] The pin for setuptools<69.3 in pyproject.toml has been removed. This pin was to prevent a sudden change in setuptools to use https://peps.python.org/pep-0625";>PEP 625 from taking place, which would change the file name of SQLAlchemy's source distribution on pypi to be an all lower case name, which is likely to cause problems with various build environments that expected the previous naming style. However, the presence of this pin is holding back environments that otherwise want to use a newer setuptools, so we've decided to move forward with this change, with the assumption that build environments will have largely accommodated the setuptools change by now. References: https://www.sqlalchemy.org/trac/ticket/11818";>#11818 orm [orm] [bug] [regression] Fixed regression from 1.3 where the column key used for a hybrid property might be populated with that of the underlying column that it returns, for a property that returns an ORM mapped column directly, rather than the key ... (truncated) Commits See full diff in https://github.com/sqlalchemy/sqlalchemy/commits";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.
Re: [I] Support to optimize, analyze tables and expire snapshots, remove orphan files [iceberg-python]
eedduuar commented on issue #31: URL: https://github.com/apache/iceberg-python/issues/31#issuecomment-2332788362 Hello, any progress? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Improve Documentation on getting started with GCS [iceberg]
github-actions[bot] commented on issue #7948: URL: https://github.com/apache/iceberg/issues/7948#issuecomment-2332934209 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Avoid concurrent commits causing commit failures [iceberg]
github-actions[bot] commented on PR #8001: URL: https://github.com/apache/iceberg/pull/8001#issuecomment-2332935581 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] metadata.json delete [iceberg]
github-actions[bot] commented on issue #8007: URL: https://github.com/apache/iceberg/issues/8007#issuecomment-2332935606 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Write ordered by within unique physical partitions folder (exclude hash path). [iceberg]
github-actions[bot] commented on issue #8008: URL: https://github.com/apache/iceberg/issues/8008#issuecomment-2332935627 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] remove_orphan_files throws reached maximum depth exception in AWS EMR-6.11.0 [iceberg]
github-actions[bot] commented on issue #8022: URL: https://github.com/apache/iceberg/issues/8022#issuecomment-2332935648 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [spark 3.4] skip empty file during table migration, table snapshotting or adding files [iceberg]
github-actions[bot] commented on PR #8040: URL: https://github.com/apache/iceberg/pull/8040#issuecomment-2332935751 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3: Adding Rebalance operator solving for small files problem [iceberg]
github-actions[bot] commented on PR #8042: URL: https://github.com/apache/iceberg/pull/8042#issuecomment-2332935802 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Add retry on UncheckedIOException and max retries for S3FileIO [iceberg]
github-actions[bot] commented on PR #8043: URL: https://github.com/apache/iceberg/pull/8043#issuecomment-2332935830 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Docs: Document accessing instance variables [iceberg]
aokolnychyi opened a new pull request, #11087: URL: https://github.com/apache/iceberg/pull/11087 This PR documents our recommendations for accessing instance variables to improve consistency of the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Pyiceberg StaticTable use the last metadata json URL when the full path is not provided [iceberg]
djouallah closed issue #7979: Pyiceberg StaticTable use the last metadata json URL when the full path is not provided URL: https://github.com/apache/iceberg/issues/7979 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Deprecate SparkAppenderFactory [iceberg]
ajantha-bhat commented on PR #11076: URL: https://github.com/apache/iceberg/pull/11076#issuecomment-2332990957 @amoghjaha-db: > Since the class is package private do we want to just remove it upfront I did get this thought initially and checked how we handled previously for other classes in this module. The package private classes like `SparkFileScan` was deprecated first and then removed. So, I went ahead and followed the same. https://github.com/user-attachments/assets/77ee43f4-aa7c-4d7e-b40e-9296ff21443f";> > It looks like it's currently used in some JMH benchmarks WritersBenchmark, and TestSparkMergingMetrics would need to be updated to test the SparkFileWriterFactory class instead. As I mentioned in one of the [comments](https://github.com/apache/iceberg/pull/11076#discussion_r1742886689), `SparkFileWriterFactory` doesn't have metrics related function. So, we cannot update the test to use it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] add more tests for position deletes [iceberg-python]
sungwy opened a new pull request, #1141: URL: https://github.com/apache/iceberg-python/pull/1141 Investigating #1132 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in [iceberg]
aokolnychyi commented on code in PR #11044: URL: https://github.com/apache/iceberg/pull/11044#discussion_r1746373228 ## core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java: ## @@ -46,11 +49,14 @@ static class BaseInheritableMetadata implements InheritableMetadata { private final int specId; private final long snapshotId; private final long sequenceNumber; +private final String manifestPath; -private BaseInheritableMetadata(int specId, long snapshotId, long sequenceNumber) { +private BaseInheritableMetadata( +int specId, long snapshotId, long sequenceNumber, String manifestPath) { Review Comment: Let's do this for now and re-evaluate default values later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in [iceberg]
aokolnychyi commented on code in PR #11044: URL: https://github.com/apache/iceberg/pull/11044#discussion_r1746373228 ## core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java: ## @@ -46,11 +49,14 @@ static class BaseInheritableMetadata implements InheritableMetadata { private final int specId; private final long snapshotId; private final long sequenceNumber; +private final String manifestPath; -private BaseInheritableMetadata(int specId, long snapshotId, long sequenceNumber) { +private BaseInheritableMetadata( +int specId, long snapshotId, long sequenceNumber, String manifestPath) { Review Comment: Let's do this for now and re-evaluate the approach once the default values are in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] TableMetadataBuilder [iceberg-rust]
liurenjie1024 commented on PR #587: URL: https://github.com/apache/iceberg-rust/pull/587#issuecomment-2333006736 Thanks @c-thiel for this pr, I've skimmed through it and it looks great to me. However this pr is too huge to review(3k lines), would you mind to split them into smaller onces? For example, we can add one pr for methods involved in one [`TableUpdate`](https://github.com/apache/iceberg-rust/blob/9862026b9f3c885a82e7b8b8da414c0c97436537/crates/iceberg/src/catalog/mod.rs#L338) action and add enough tests for it? Also it would be better to put refactoring `TableMetadataBuilder` in a standalone module a pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Discussion: Typesafe(r) properties [iceberg-rust]
liurenjie1024 commented on issue #599: URL: https://github.com/apache/iceberg-rust/issues/599#issuecomment-2333037728 Thanks @c-thiel for raising this. I love this idea of type safe properties and believe this is the right direction to go. I took a look at your reference, but I didn't get a good understanding of how the api looks like. Would you mind to show some code examples of the usage? For example, what rest/sql catalog properties would look like? In fact, we can leave macros to last part since it's a tool to reduce duplication and maintaince effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Why call deleteKey for Insert and Update After in Flink BaseDeltaTaskWriter? [iceberg]
SML0127 commented on issue #11081: URL: https://github.com/apache/iceberg/issues/11081#issuecomment-2333065332 @pvary Thank you for your support. It really helped me a lot. Lastly, I have one more question. I am looking at the code `ColumnarBatchReader.java`. Where can I see that the equality delete file is applied to the previous snapshot and the pos delete file is applied to the current snapshot? 🙇🏻♂️🙇🏻♂️🙇🏻♂️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Document accessing instance variables [iceberg]
manuzhang commented on code in PR #11087: URL: https://github.com/apache/iceberg/pull/11087#discussion_r1746409783 ## site/docs/contribute.md: ## @@ -388,6 +388,34 @@ When passing boolean arguments to existing or external methods, use inline comme dropTable(identifier, purge); ``` + Accessing instance variables Review Comment: Curious whether there is a format rule for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: implement IcebergTableProviderFactory for datafusion [iceberg-rust]
liurenjie1024 commented on code in PR #600: URL: https://github.com/apache/iceberg-rust/pull/600#discussion_r1746464932 ## crates/integrations/datafusion/src/table/table_provider_factory.rs: ## @@ -0,0 +1,180 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; +use std::sync::Arc; + +use async_trait::async_trait; +use datafusion::catalog::{Session, TableProvider, TableProviderFactory}; +use datafusion::error::Result as DFResult; +use datafusion::logical_expr::CreateExternalTable; +use datafusion::sql::TableReference; +use iceberg::arrow::schema_to_arrow_schema; +use iceberg::io::FileIO; +use iceberg::table::StaticTable; +use iceberg::{Error, ErrorKind, Result, TableIdent}; + +use super::IcebergTableProvider; +use crate::to_datafusion_error; + +#[derive(Default)] +#[non_exhaustive] +pub struct IcebergTableProviderFactory {} + +impl IcebergTableProviderFactory { +pub fn new() -> Self { +Self {} +} +} + +#[async_trait] +impl TableProviderFactory for IcebergTableProviderFactory { +async fn create( +&self, +_state: &dyn Session, +cmd: &CreateExternalTable, +) -> DFResult> { +check_cmd(cmd).map_err(to_datafusion_error)?; + +let table_name = &cmd.name; +let metadata_file_path = &cmd.location; +let options = &cmd.options; + +let table = create_static_table(table_name, metadata_file_path, options) +.await +.map_err(to_datafusion_error)? +.into_table(); + +let schema = schema_to_arrow_schema(table.metadata().current_schema()) +.map_err(to_datafusion_error)?; + +Ok(Arc::new(IcebergTableProvider::new(table, Arc::new(schema +} +} + +fn check_cmd(cmd: &CreateExternalTable) -> Result<()> { +let CreateExternalTable { +schema, +table_partition_cols, +order_exprs, +constraints, +column_defaults, +.. +} = cmd; + +if !schema.fields().is_empty() { +return Err(Error::new( +ErrorKind::FeatureUnsupported, +"Schema specification is not supported", +)); +} + +if !table_partition_cols.is_empty() { +return Err(Error::new( +ErrorKind::FeatureUnsupported, +"Partition columns cannot be specified", +)); +} + +if !order_exprs.is_empty() { +return Err(Error::new( +ErrorKind::FeatureUnsupported, +"Ordering clause is not supported", +)); +} + +if !constraints.is_empty() { +return Err(Error::new( +ErrorKind::FeatureUnsupported, +"Constraints are not supported", +)); +} + +if !column_defaults.is_empty() { +return Err(Error::new( +ErrorKind::FeatureUnsupported, +"Default values for columns are not supported", +)); +} + +Ok(()) +} + +async fn create_static_table( +table_name: &TableReference, +metadata_file_path: &str, +props: &HashMap, +) -> Result { +let table_ident = TableIdent::from_strs(table_name.to_vec())?; +let file_io = FileIO::from_path(metadata_file_path)? +.with_props(props) +.build()?; +StaticTable::from_metadata_file(metadata_file_path, table_ident, file_io).await +} + +#[cfg(test)] +mod tests { + +use datafusion::catalog::TableProviderFactory; +use datafusion::common::{Constraints, DFSchema}; +use datafusion::logical_expr::CreateExternalTable; +use datafusion::prelude::SessionContext; +use datafusion::sql::TableReference; + +use super::*; + +#[tokio::test] +async fn test_schema_of_created_table() { +let factory = IcebergTableProviderFactory::new(); + +let metadata_file_path = format!( +"{}/testdata/table_metadata/{}", +env!("CARGO_MANIFEST_DIR"), +"TableMetadataV2.json" +); + +let cmd = CreateExternalTable { +name: TableReference::partial("static_ns", "static_table"), +location: metadata_file_path, +schema: Arc::new(DFSchema::empty()), +file_type: "iceberg".to_string(), +opti
Re: [PR] feat: implement IcebergTableProviderFactory for datafusion [iceberg-rust]
liurenjie1024 commented on code in PR #600: URL: https://github.com/apache/iceberg-rust/pull/600#discussion_r1746466160 ## crates/integrations/datafusion/src/table/table_provider_factory.rs: ## @@ -0,0 +1,180 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; +use std::sync::Arc; + +use async_trait::async_trait; +use datafusion::catalog::{Session, TableProvider, TableProviderFactory}; +use datafusion::error::Result as DFResult; +use datafusion::logical_expr::CreateExternalTable; +use datafusion::sql::TableReference; +use iceberg::arrow::schema_to_arrow_schema; +use iceberg::io::FileIO; +use iceberg::table::StaticTable; +use iceberg::{Error, ErrorKind, Result, TableIdent}; + +use super::IcebergTableProvider; +use crate::to_datafusion_error; + +#[derive(Default)] +#[non_exhaustive] +pub struct IcebergTableProviderFactory {} Review Comment: It would be better to provide some doc to explain its usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Document accessing instance variables [iceberg]
aokolnychyi commented on code in PR #11087: URL: https://github.com/apache/iceberg/pull/11087#discussion_r1746503833 ## site/docs/contribute.md: ## @@ -388,6 +388,34 @@ When passing boolean arguments to existing or external methods, use inline comme dropTable(identifier, purge); ``` + Accessing instance variables Review Comment: I wanted to look into ways to automate this after documenting. I took a look now. It doesn't seem there is an existing checkstyle rule that would enforce the behavior we are looking for. We may try doing something with regular expressions but it may be hard to capture the context. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Table Scan: Add Row Selection Filtering [iceberg-rust]
sdd commented on code in PR #565: URL: https://github.com/apache/iceberg-rust/pull/565#discussion_r1746573022 ## crates/iceberg/src/expr/visitors/page_index_evaluator.rs: ## @@ -0,0 +1,1491 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Evaluates predicates against a Parquet Page Index + +use std::collections::HashMap; + +use fnv::FnvHashSet; +use ordered_float::OrderedFloat; +use parquet::arrow::arrow_reader::{RowSelection, RowSelector}; +use parquet::file::metadata::RowGroupMetaData; +use parquet::file::page_index::index::Index; +use parquet::format::PageLocation; + +use crate::expr::visitors::bound_predicate_visitor::{visit, BoundPredicateVisitor}; +use crate::expr::{BoundPredicate, BoundReference}; +use crate::spec::{Datum, PrimitiveLiteral, PrimitiveType, Schema}; +use crate::{Error, ErrorKind, Result}; + +type OffsetIndex = Vec>; + +const IN_PREDICATE_LIMIT: usize = 200; + +enum MissingColBehavior { +CantMatch, +MightMatch, +} + +enum PageNullCount { +AllNull, +NoneNull, +SomeNull, +Unknown, +} + +impl PageNullCount { +fn from_row_and_null_counts(num_rows: usize, null_count: Option) -> Self { +match (num_rows, null_count) { +(x, Some(y)) if x == y as usize => PageNullCount::AllNull, +(_, Some(0)) => PageNullCount::NoneNull, +(_, Some(_)) => PageNullCount::SomeNull, +_ => PageNullCount::Unknown, +} +} +} + +pub(crate) struct PageIndexEvaluator<'a> { +column_index: &'a [Index], +offset_index: &'a OffsetIndex, +row_group_metadata: &'a RowGroupMetaData, +iceberg_field_id_to_parquet_column_index: &'a HashMap, +snapshot_schema: &'a Schema, +} + +impl<'a> PageIndexEvaluator<'a> { +pub(crate) fn new( +column_index: &'a [Index], +offset_index: &'a OffsetIndex, +row_group_metadata: &'a RowGroupMetaData, +field_id_map: &'a HashMap, +snapshot_schema: &'a Schema, +) -> Self { +Self { +column_index, +offset_index, +row_group_metadata, +iceberg_field_id_to_parquet_column_index: field_id_map, +snapshot_schema, +} +} + +/// Evaluate this `PageIndexEvaluator`'s filter predicate against a +/// specific page's column index entry in a parquet file's page index. +/// [`ArrowReader`] uses the resulting [`RowSelection`] to reject +/// pages within a parquet file's row group that cannot contain rows +/// matching the filter predicate. +pub(crate) fn eval( +filter: &'a BoundPredicate, +column_index: &'a [Index], +offset_index: &'a OffsetIndex, +row_group_metadata: &'a RowGroupMetaData, +field_id_map: &'a HashMap, +snapshot_schema: &'a Schema, +) -> Result> { +if row_group_metadata.num_rows() == 0 { +return Ok(vec![]); +} + +let mut evaluator = Self::new( +column_index, +offset_index, +row_group_metadata, +field_id_map, +snapshot_schema, +); + +Ok(visit(&mut evaluator, filter)?.iter().copied().collect()) +} + +fn select_all_rows(&self) -> Result { +Ok(vec![RowSelector::select( +self.row_group_metadata.num_rows() as usize +)] +.into()) +} + +fn skip_all_rows(&self) -> Result { +Ok(vec![].into()) +} + +fn calc_row_selection( +&self, +field_id: i32, +predicate: F, +missing_col_behavior: MissingColBehavior, +) -> Result +where +F: Fn(Option, Option, PageNullCount) -> Result, +{ +let Some(&parquet_column_index) = +self.iceberg_field_id_to_parquet_column_index.get(&field_id) +else { +// if the snapshot's column is not present in the row group, +// exit early +return match missing_col_behavior { +MissingColBehavior::CantMatch => self.skip_all_rows(), +MissingColBehavior::MightMatch => self.select_all_rows(), +}; +}; + +let Some(field) = self.snapshot_schema.field_by_id(field_id) else
Re: [PR] Table Scan: Add Row Selection Filtering [iceberg-rust]
sdd commented on code in PR #565: URL: https://github.com/apache/iceberg-rust/pull/565#discussion_r1746597564 ## crates/iceberg/src/expr/visitors/page_index_evaluator.rs: ## @@ -0,0 +1,1491 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Evaluates predicates against a Parquet Page Index + +use std::collections::HashMap; + +use fnv::FnvHashSet; +use ordered_float::OrderedFloat; +use parquet::arrow::arrow_reader::{RowSelection, RowSelector}; +use parquet::file::metadata::RowGroupMetaData; +use parquet::file::page_index::index::Index; +use parquet::format::PageLocation; + +use crate::expr::visitors::bound_predicate_visitor::{visit, BoundPredicateVisitor}; +use crate::expr::{BoundPredicate, BoundReference}; +use crate::spec::{Datum, PrimitiveLiteral, PrimitiveType, Schema}; +use crate::{Error, ErrorKind, Result}; + +type OffsetIndex = Vec>; + +const IN_PREDICATE_LIMIT: usize = 200; + +enum MissingColBehavior { +CantMatch, +MightMatch, +} + +enum PageNullCount { +AllNull, +NoneNull, +SomeNull, +Unknown, +} + +impl PageNullCount { +fn from_row_and_null_counts(num_rows: usize, null_count: Option) -> Self { +match (num_rows, null_count) { +(x, Some(y)) if x == y as usize => PageNullCount::AllNull, +(_, Some(0)) => PageNullCount::NoneNull, +(_, Some(_)) => PageNullCount::SomeNull, +_ => PageNullCount::Unknown, +} +} +} + +pub(crate) struct PageIndexEvaluator<'a> { +column_index: &'a [Index], +offset_index: &'a OffsetIndex, +row_group_metadata: &'a RowGroupMetaData, +iceberg_field_id_to_parquet_column_index: &'a HashMap, +snapshot_schema: &'a Schema, +} + +impl<'a> PageIndexEvaluator<'a> { +pub(crate) fn new( +column_index: &'a [Index], +offset_index: &'a OffsetIndex, +row_group_metadata: &'a RowGroupMetaData, +field_id_map: &'a HashMap, +snapshot_schema: &'a Schema, +) -> Self { +Self { +column_index, +offset_index, +row_group_metadata, +iceberg_field_id_to_parquet_column_index: field_id_map, +snapshot_schema, +} +} + +/// Evaluate this `PageIndexEvaluator`'s filter predicate against a +/// specific page's column index entry in a parquet file's page index. +/// [`ArrowReader`] uses the resulting [`RowSelection`] to reject +/// pages within a parquet file's row group that cannot contain rows +/// matching the filter predicate. +pub(crate) fn eval( +filter: &'a BoundPredicate, +column_index: &'a [Index], +offset_index: &'a OffsetIndex, +row_group_metadata: &'a RowGroupMetaData, +field_id_map: &'a HashMap, +snapshot_schema: &'a Schema, +) -> Result> { +if row_group_metadata.num_rows() == 0 { +return Ok(vec![]); +} + +let mut evaluator = Self::new( +column_index, +offset_index, +row_group_metadata, +field_id_map, +snapshot_schema, +); + +Ok(visit(&mut evaluator, filter)?.iter().copied().collect()) +} + +fn select_all_rows(&self) -> Result { +Ok(vec![RowSelector::select( +self.row_group_metadata.num_rows() as usize +)] +.into()) +} + +fn skip_all_rows(&self) -> Result { +Ok(vec![].into()) Review Comment: Good spot - fixed. I think the original would work but your suggestion looks more intuitive to anyone reading the code. ## crates/iceberg/src/expr/visitors/page_index_evaluator.rs: ## @@ -0,0 +1,1491 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed t