Re: [PR] Flink - Fix incorrect / old row being written into delta files when using upsert mode [iceberg]

2024-09-05 Thread via GitHub


pvary commented on code in PR #4364:
URL: https://github.com/apache/iceberg/pull/4364#discussion_r1744988793


##
flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java:
##
@@ -74,7 +80,7 @@ public void write(RowData row) throws IOException {
   case INSERT:
   case UPDATE_AFTER:
 if (upsert) {
-  writer.delete(row);
+  writer.deleteKey(deleteSchemaProjection.wrap(row));

Review Comment:
   There is no way to distinguish between why the equality delete was written. 
Especially as the equality delete could have been written by other engines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Why call deleteKey for Insert and Update After in Flink BaseDeltaTaskWriter? [iceberg]

2024-09-05 Thread via GitHub


SML0127 commented on issue #11081:
URL: https://github.com/apache/iceberg/issues/11081#issuecomment-2330951529

   @pvary 
   Thk for answer pvary. 
   Now I'm checking some code in ColumnarBatchReader, whether it works as 
follow: 
   
   1. apply eq delete files first, from oldest snapshot
   2. then apply data file and pos delete files
   3. reapply from step 1 for the next snapshot.
   
   I'm still looking into it so it could be wrong.
   
   ```java
   ColumnarBatch loadDataToColumnBatch() {
 int numRowsUndeleted = initRowIdMapping();
 
 ColumnVector[] arrowColumnVectors = readDataToColumnVectors();  
 
 ColumnarBatch newColumnarBatch = new ColumnarBatch(arrowColumnVectors);  
   
 newColumnarBatch.setNumRows(numRowsUndeleted);  
 
 if (hasEqDeletes()) {  // eq 있는지 한번더 확인?
   applyEqDelete(newColumnarBatch);  
 }  
 
 if (hasIsDeletedColumn && rowIdMapping != null) {  
   // reset the row id mapping array, so that it doesn't filter out the 
deleted rows  
   for (int i = 0; i < numRowsToRead; i++) {  
 rowIdMapping[i] = i;  
   }  
   newColumnarBatch.setNumRows(numRowsToRead);  
 }  
 
 return newColumnarBatch;  
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink: multiple sinks for the different iceberg tables in the same job? [iceberg]

2024-09-05 Thread via GitHub


chenwyi2 closed issue #11074: Flink: multiple sinks for the different iceberg 
tables in the same job?
URL: https://github.com/apache/iceberg/issues/11074


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink: multiple sinks for the different iceberg tables in the same job? [iceberg]

2024-09-05 Thread via GitHub


chenwyi2 commented on issue #11074:
URL: https://github.com/apache/iceberg/issues/11074#issuecomment-2331046697

   yes that's no iceberg problem, but with our platform, thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Caused by: java.lang.RuntimeException: org.apache.flink.runtime.JobException: Creating the input splits caused an error: Failed to refresh the table [iceberg]

2024-09-05 Thread via GitHub


Littlehhao opened a new issue, #11082:
URL: https://github.com/apache/iceberg/issues/11082

   ### Query engine
   
   _No response_
   
   ### Question
   
   environment:
   
   flink-standlone:1.17.1
   hadoop 3.1.0
   
   select * from sample;
   
   Caused by: org.apache.flink.runtime.client.JobInitializationException: Could 
not start the JobMaster.
   at 
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
 ~[flink-dist-1.17.1.jar:1.17.1]
   at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown 
Source) ~[?:1.8.0_422]
   at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
~[?:1.8.0_422]
   at java.lang.Thread.run(Unknown Source) ~[?:1.8.0_422]
   Caused by: java.util.concurrent.CompletionException: 
java.lang.RuntimeException: org.apache.flink.runtime.JobException: Creating the 
input splits caused an error: Failed to refresh the table
   at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
~[?:1.8.0_422]
   at java.lang.Thread.run(Unknown Source) ~[?:1.8.0_422]
   Caused by: java.lang.RuntimeException: 
org.apache.flink.runtime.JobException: Creating the input splits caused an 
error: Failed to refresh the table
   at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321) 
~[flink-dist-1.17.1.jar:1.17.1]
   at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:114)
 ~[flink-dist-1.17.1.jar:1.17.1]
   at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
~[?:1.8.0_422]
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
~[?:1.8.0_422]
   at java.lang.Thread.run(Unknown Source) ~[?:1.8.0_422]
   Caused by: org.apache.flink.runtime.JobException: Creating the input splits 
caused an error: Failed to refresh the table


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Include third party licenses and notices in distribution [iceberg]

2024-09-05 Thread via GitHub


ajantha-bhat commented on code in PR #10829:
URL: https://github.com/apache/iceberg/pull/10829#discussion_r1745249956


##
kafka-connect/kafka-connect-runtime/NOTICE:
##
@@ -0,0 +1,1723 @@
+
+Apache Iceberg
+Copyright 2017-2024 The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+
+
+This project includes code from Kite, developed at Cloudera, Inc. with
+the following copyright notice:
+
+| Copyright 2013 Cloudera Inc.
+|
+| Licensed under the Apache License, Version 2.0 (the "License");
+| you may not use this file except in compliance with the License.
+| You may obtain a copy of the License at
+|
+|   http://www.apache.org/licenses/LICENSE-2.0
+|
+| Unless required by applicable law or agreed to in writing, software
+| distributed under the License is distributed on an "AS IS" BASIS,
+| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+| See the License for the specific language governing permissions and
+| limitations under the License.
+
+
+
+This binary artifact contains code from the following projects:
+
+
+
+Group: org.apache.commons  Name: commons-math3  Version: 3.1.1
+
+Notice: Apache Commons Math
+Copyright 2001-2012 The Apache Software Foundation
+
+This product includes software developed by
+The Apache Software Foundation (http://www.apache.org/).
+
+===
+
+The BracketFinder (package org.apache.commons.math3.optimization.univariate)

Review Comment:
   I don't see any direct dependency on this. 
   
   I am not even sure this is a right way to generate notice and license. 
   Because it looks so much different from spark runtime for example. 
   
   Maybe we need to add how to generate notice and license for modules in 
`contribute.md` first (also need to mention when do we need to add it, like 
only for runtime jar modules) and then maybe follow that process. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Include third party licenses and notices in distribution [iceberg]

2024-09-05 Thread via GitHub


ajantha-bhat commented on code in PR #10829:
URL: https://github.com/apache/iceberg/pull/10829#discussion_r1745249956


##
kafka-connect/kafka-connect-runtime/NOTICE:
##
@@ -0,0 +1,1723 @@
+
+Apache Iceberg
+Copyright 2017-2024 The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+
+
+This project includes code from Kite, developed at Cloudera, Inc. with
+the following copyright notice:
+
+| Copyright 2013 Cloudera Inc.
+|
+| Licensed under the Apache License, Version 2.0 (the "License");
+| you may not use this file except in compliance with the License.
+| You may obtain a copy of the License at
+|
+|   http://www.apache.org/licenses/LICENSE-2.0
+|
+| Unless required by applicable law or agreed to in writing, software
+| distributed under the License is distributed on an "AS IS" BASIS,
+| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+| See the License for the specific language governing permissions and
+| limitations under the License.
+
+
+
+This binary artifact contains code from the following projects:
+
+
+
+Group: org.apache.commons  Name: commons-math3  Version: 3.1.1
+
+Notice: Apache Commons Math
+Copyright 2001-2012 The Apache Software Foundation
+
+This product includes software developed by
+The Apache Software Foundation (http://www.apache.org/).
+
+===
+
+The BracketFinder (package org.apache.commons.math3.optimization.univariate)

Review Comment:
   I don't see any direct dependency on this. 
   
   I am not even sure this is a right way to generate notice and license. 
   Because it looks so much different from spark runtime for example. 
   
   Maybe we need to add how to generate notice and license for modules in 
"contribute.md" first (also need to mention when do we need to add it, like 
only for runtime jar modules) and then maybe follow that process. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Why call deleteKey for Insert and Update After in Flink BaseDeltaTaskWriter? [iceberg]

2024-09-05 Thread via GitHub


SML0127 commented on issue #11081:
URL: https://github.com/apache/iceberg/issues/11081#issuecomment-2331421137

   @pvary 
   Sorry for asking a vague question. 
   But thx to your answer, now I understand how data file and delete file works.
   That is, 
   - equality delete files are valid only up to the previous snapshot (applied 
to previous snapshot)
   - pos delete files are valid up to the current snapshot (applied to 
current(latest) snapshot)
   
   Please check if I understand corrently🙏🙏


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support projection pushdown for datafusion iceberg [iceberg-rust]

2024-09-05 Thread via GitHub


FANNG1 commented on code in PR #594:
URL: https://github.com/apache/iceberg-rust/pull/594#discussion_r1745529020


##
crates/integrations/datafusion/src/physical_plan/scan.rs:
##
@@ -138,3 +156,18 @@ async fn get_batch_stream(
 
 Ok(Box::pin(stream))
 }
+
+fn get_column_names(schema: ArrowSchemaRef, projection: Option<&Vec>) 
-> Vec {

Review Comment:
   done



##
crates/integrations/datafusion/src/physical_plan/scan.rs:
##
@@ -116,7 +125,11 @@ impl DisplayAs for IcebergTableScan {
 _t: datafusion::physical_plan::DisplayFormatType,
 f: &mut std::fmt::Formatter,
 ) -> std::fmt::Result {
-write!(f, "IcebergTableScan")
+write!(
+f,
+"IcebergTableScan projection:[{}]",
+self.projection.join(" ")

Review Comment:
   done



##
crates/examples/src/datafusion_read_data.rs:
##


Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support projection pushdown for datafusion iceberg [iceberg-rust]

2024-09-05 Thread via GitHub


FANNG1 commented on PR #594:
URL: https://github.com/apache/iceberg-rust/pull/594#issuecomment-2331731075

   @liurenjie1024 , all comments are addressed, please help to review again, thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Docs on configuring the sink [iceberg]

2024-09-05 Thread via GitHub


bryanck commented on code in PR #10746:
URL: https://github.com/apache/iceberg/pull/10746#discussion_r1745578149


##
docs/docs/kafka-connect.md:
##
@@ -0,0 +1,352 @@
+---
+title: "Kafka Connect"
+---
+
+
+# Kafka Connect
+
+[Kafka Connect](https://docs.confluent.io/platform/current/connect/index.html) 
is a popular framework for moving data
+in and out of Kafka via connectors. There are many different different 
connectors available, such as the S3 sink
+for writing data from Kafka to S3 and Debezium source connectors for writing 
change data capture records from relational
+databases to Kafka.
+
+It has a straightforward, decentralized, distributed architecture. A cluster 
consists of a number of worker processes,
+and a connector runs tasks on these processes to perform the work. Connector 
deployment is configuration driven, so
+generally no code needs to be written to run a connector.
+
+## Apache Iceberg Sink Connector
+
+The Apache Iceberg Sink Connector for Kafka Connect is a sink connector for 
writing data from Kafka into Iceberg tables.
+
+## Features
+
+* Commit coordination for centralized Iceberg commits
+* Exactly-once delivery semantics
+* Multi-table fan-out
+* Automatic table creation and schema evolution
+* Field name mapping via Iceberg’s column mapping functionality
+
+## Installation
+
+The connector zip archive is created as part of the Iceberg build. You can run 
the build via:
+```bash
+./gradlew -xtest -xintegrationTest clean build
+```
+The zip archive will be found under 
`./kafka-connect/kafka-connect-runtime/build/distributions`. There is
+one distribution that bundles the Hive Metastore client and related 
dependencies, and one that does not.
+Copy the distribution archive into the Kafka Connect plugins directory on all 
nodes.
+
+## Requirements
+
+The sink relies on 
[KIP-447](https://cwiki.apache.org/confluence/display/KAFKA/KIP-447%3A+Producer+scalability+for+exactly+once+semantics)
+for exactly-once semantics. This requires Kafka 2.5 or later.
+
+## Configuration
+
+| Property   | Description 

 |
+||--|
+| iceberg.tables | Comma-separated list of 
destination tables  
 |
+| iceberg.tables.dynamic-enabled | Set to `true` to route to a 
table specified in `routeField` instead of using `routeRegex`, default is 
`false`|
+| iceberg.tables.route-field | For multi-table fan-out, the 
name of the field used to route records to tables   
|
+| iceberg.tables.default-commit-branch   | Default branch for commits, 
main is used if not specified   
 |
+| iceberg.tables.default-id-columns  | Default comma-separated list of 
columns that identify a row in tables (primary key) 
 |
+| iceberg.tables.default-partition-by| Default comma-separated list of 
partition field names to use when creating tables   
 |
+| iceberg.tables.auto-create-enabled | Set to `true` to automatically 
create destination tables, default is `false`   
  |
+| iceberg.tables.evolve-schema-enabled   | Set to `true` to add any 
missing record fields to the table schema, default is `false`   
|
+| iceberg.tables.schema-force-optional   | Set to `true` to set columns as 
optional during table create and evolution, default is `false` to respect 
schema |
+| iceberg.tables.schema-case-insensitive | Set to `true` to look up table 
columns by case-insensitive name, default is `false` for case-sensitive 
  |
+| iceberg.tables.auto-create-props.* | Properties set on new tables 
during auto-create  
|
+| iceberg.tables.write-props.*   | Properties passed through to 
Iceberg writer initialization, these take precedence
|
+| iceberg.table.\.commit-branch | Table-specific branch for 
commits, use `iceberg.tables.default-commit-branch` if not specified
   |
+| iceberg.table.\.id-columns| Comma-separated list of columns 
that identify a row in the table (primary key)  
 |
+| iceberg.table.\.partition-by  | Comma-separated list of 
partition fields to use when creating the table 
 |
+| iceberg.table.\.route-regex   | The regex used to match a 
record's `routeField` to a table
   |
+| iceberg.control.topic 

Re: [PR] Kafka Connect: increase timeout for integration test [iceberg]

2024-09-05 Thread via GitHub


bryanck commented on PR #11075:
URL: https://github.com/apache/iceberg/pull/11075#issuecomment-2331804459

   > Can we create a separate CI for kafka connect?
   
   Sure let me add that to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Bump cryptography from 43.0.0 to 43.0.1 [iceberg-python]

2024-09-05 Thread via GitHub


Fokko merged PR #1130:
URL: https://github.com/apache/iceberg-python/pull/1130


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Bump mkdocs-material from 9.5.33 to 9.5.34 [iceberg-python]

2024-09-05 Thread via GitHub


Fokko merged PR #1126:
URL: https://github.com/apache/iceberg-python/pull/1126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Make `commit_table` public [iceberg-python]

2024-09-05 Thread via GitHub


Fokko commented on PR #1112:
URL: https://github.com/apache/iceberg-python/pull/1112#issuecomment-2331938676

   @sungwy No problem at all, I've pulled in latest master 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] MR: iceberg storage handler should set common projection pruning config [iceberg]

2024-09-05 Thread via GitHub


ludlows closed pull request #10188: MR: iceberg storage handler should set 
common projection pruning config
URL: https://github.com/apache/iceberg/pull/10188


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] open-api: Fix compile warnings for testFixtures [iceberg]

2024-09-05 Thread via GitHub


danielcweeks merged PR #11071:
URL: https://github.com/apache/iceberg/pull/11071


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Mandate identifier fields when create_changelog_view for table contain unsortable columns [iceberg]

2024-09-05 Thread via GitHub


dramaticlly commented on code in PR #11045:
URL: https://github.com/apache/iceberg/pull/11045#discussion_r1745898693


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/CreateChangelogViewProcedure.java:
##
@@ -146,10 +147,16 @@ public InternalRow[] call(InternalRow args) {
 Dataset df = loadRows(changelogTableIdent, options(input));
 
 boolean netChanges = input.asBoolean(NET_CHANGES, false);
+String[] identifierColumns = identifierColumns(input, tableIdent);
+Preconditions.checkArgument(
+identifierColumns.length > 0
+|| Arrays.stream(df.schema().fields())
+.allMatch(field -> OrderUtils.isOrderable(field.dataType())),
+"Identifier field is required if table contains unorderable columns");

Review Comment:
   thank you @flyrain and @karuppayya and that's a good point to judge based on 
existing error message , added unorderable column name as suggested. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.3, 3.4: Parallelize reading files in migrate procedures [iceberg]

2024-09-05 Thread via GitHub


amogh-jahagirdar commented on PR #11043:
URL: https://github.com/apache/iceberg/pull/11043#issuecomment-2332252827

   Thanks @manuzhang! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.3, 3.4: Parallelize reading files in migrate procedures [iceberg]

2024-09-05 Thread via GitHub


amogh-jahagirdar merged PR #11043:
URL: https://github.com/apache/iceberg/pull/11043


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Rust <> Python integration point [iceberg-rust]

2024-09-05 Thread via GitHub


kevinjqliu commented on issue #538:
URL: https://github.com/apache/iceberg-rust/issues/538#issuecomment-2332259085

   Looks like @sungwy already started by exposing Transforms in #556 
   
   I'll take a stab at exposing the Catalogs, see 
https://github.com/apache/iceberg-rust/pull/534#issuecomment-2330489500


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Support python 3.12 [iceberg-python]

2024-09-05 Thread via GitHub


Fokko commented on code in PR #1068:
URL: https://github.com/apache/iceberg-python/pull/1068#discussion_r1745925508


##
pyproject.toml:
##
@@ -611,6 +615,8 @@ filterwarnings = [
   "ignore:unclosed 

Re: [I] Add support for Python 3.12 [iceberg-python]

2024-09-05 Thread via GitHub


kevinjqliu closed issue #28: Add support for Python 3.12
URL: https://github.com/apache/iceberg-python/issues/28


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Include third party licenses and notices in distribution [iceberg]

2024-09-05 Thread via GitHub


danielcweeks commented on code in PR #10829:
URL: https://github.com/apache/iceberg/pull/10829#discussion_r1745961589


##
kafka-connect/kafka-connect-runtime/NOTICE:
##
@@ -0,0 +1,1723 @@
+
+Apache Iceberg
+Copyright 2017-2024 The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+
+
+This project includes code from Kite, developed at Cloudera, Inc. with

Review Comment:
   I'm not sure if this is right.  Looking at the [Kite 
NOTICE](https://github.com/kite-sdk/kite/blob/master/NOTICE.txt), it has 
entries I don't see here.  (I didn't check all dependencies.  The first one 
appears to have an issue, so want to make sure this was generated correctly)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Mandate identifier fields when create_changelog_view for table contain unsortable columns [iceberg]

2024-09-05 Thread via GitHub


flyrain merged PR #11045:
URL: https://github.com/apache/iceberg/pull/11045


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Mandate identifier fields when create_changelog_view for table contain unsortable columns [iceberg]

2024-09-05 Thread via GitHub


flyrain commented on PR #11045:
URL: https://github.com/apache/iceberg/pull/11045#issuecomment-2332359477

   Thanks @dramaticlly for the change, and thanks @karuppayya @anigos 
@huaxingao for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Include third party licenses and notices in distribution [iceberg]

2024-09-05 Thread via GitHub


bryanck commented on code in PR #10829:
URL: https://github.com/apache/iceberg/pull/10829#discussion_r1745974177


##
kafka-connect/kafka-connect-runtime/NOTICE:
##
@@ -0,0 +1,1723 @@
+
+Apache Iceberg
+Copyright 2017-2024 The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+
+
+This project includes code from Kite, developed at Cloudera, Inc. with

Review Comment:
   I just appended notices to the root project notice. I could remove that if 
we don't need it. That's true of the cloud bundle notices also.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Expose `Catalog` trait as python binding [iceberg-rust]

2024-09-05 Thread via GitHub


kevinjqliu opened a new pull request, #604:
URL: https://github.com/apache/iceberg-rust/pull/604

   #538
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] New PR label [ready for review] [iceberg-python]

2024-09-05 Thread via GitHub


sungwy commented on issue #1123:
URL: 
https://github.com/apache/iceberg-python/issues/1123#issuecomment-2332404830

   As a counter argument, what's the difference between using this label, 
versus opening a PR in Draft mode?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Cannot commit identity partition on datatypes time,timestamp* using 'fromPartitionString' [iceberg]

2024-09-05 Thread via GitHub


mderoy commented on issue #11085:
URL: https://github.com/apache/iceberg/issues/11085#issuecomment-2332466946

   We'll contribute on this in a few weeks when one of our developers gets back 
from vacation.
   
https://github.com/apache/iceberg/commit/770f84325f2810bae3b48a6e1983d6a8135cb7bf
 seems like something that can guide us to an implementation, but if there is 
any other background information you may have that would be helpful we'd love 
to hear it :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Remove deprecated `datetime` functions [iceberg-python]

2024-09-05 Thread via GitHub


hussein-awala opened a new pull request, #1134:
URL: https://github.com/apache/iceberg-python/pull/1134

   closes: #1133


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Inconsistent row count across versions [iceberg-python]

2024-09-05 Thread via GitHub


dev-goyal commented on issue #1132:
URL: 
https://github.com/apache/iceberg-python/issues/1132#issuecomment-2332547741

   Thanks @sungwy , that makes sense to me - I am indeed using MOR (version 2), 
so this makes sense to me! Let me know how else I might be able to help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Core: Optimize writing metadata for many new files [iceberg]

2024-09-05 Thread via GitHub


aokolnychyi opened a new pull request, #11086:
URL: https://github.com/apache/iceberg/pull/11086

   This PR optimizes writing metadata for many new files, which is helpful 
during initial table creation and row-level operations that modify many files.
   
   Benchmark results prior to the changes:
   
   ```
   Benchmark(fast)  (numFiles)  Mode  Cnt   Score   Error  
Units
   AppendBenchmark.appendFilestrue   5ss5   1.111 ± 0.035   
s/op
   AppendBenchmark.appendFilestrue  10ss5   2.114 ± 0.049   
s/op
   AppendBenchmark.appendFilestrue  50ss5  10.144 ± 0.182   
s/op
   AppendBenchmark.appendFilestrue 100ss5  20.205 ± 0.388   
s/op
   AppendBenchmark.appendFilestrue 250ss5  51.280 ± 3.610   
s/op
   AppendBenchmark.appendFiles   false   5ss5   1.125 ± 0.084   
s/op
   AppendBenchmark.appendFiles   false  10ss5   2.107 ± 0.095   
s/op
   AppendBenchmark.appendFiles   false  50ss5  10.117 ± 0.398   
s/op
   AppendBenchmark.appendFiles   false 100ss5  20.350 ± 1.046   
s/op
   AppendBenchmark.appendFiles   false 250ss5  48.823 ± 3.604   
s/op
   ```
   
   Benchmark results after the changes:
   
   ```
   Benchmark(fast)  (numFiles)  Mode  Cnt  Score   Error  
Units
   AppendBenchmark.appendFilestrue   5ss5  0.383 ± 0.072   
s/op
   AppendBenchmark.appendFilestrue  10ss5  0.435 ± 0.031   
s/op
   AppendBenchmark.appendFilestrue  50ss5  1.283 ± 0.152   
s/op
   AppendBenchmark.appendFilestrue 100ss5  2.325 ± 0.411   
s/op
   AppendBenchmark.appendFilestrue 250ss5  5.615 ± 1.203   
s/op
   AppendBenchmark.appendFiles   false   5ss5  0.403 ± 0.023   
s/op
   AppendBenchmark.appendFiles   false  10ss5  0.484 ± 0.058   
s/op
   AppendBenchmark.appendFiles   false  50ss5  1.357 ± 0.190   
s/op
   AppendBenchmark.appendFiles   false 100ss5  2.596 ± 0.829   
s/op
   AppendBenchmark.appendFiles   false 250ss5  5.993 ± 1.447   
s/op
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Optimize writing metadata for many new files [iceberg]

2024-09-05 Thread via GitHub


aokolnychyi commented on code in PR #11086:
URL: https://github.com/apache/iceberg/pull/11086#discussion_r1746178142


##
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##
@@ -654,4 +740,38 @@ private static void updateTotal(
   }
 }
   }
+
+  protected static class DeleteFileHolder {

Review Comment:
   Simply moved from `MergingSnapshotProducer`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core/RewriteFiles: Duplicate Data Bug - Fixed dropping delete files that are still required [iceberg]

2024-09-05 Thread via GitHub


amogh-jahagirdar commented on code in PR #10962:
URL: https://github.com/apache/iceberg/pull/10962#discussion_r1746200353


##
core/src/test/java/org/apache/iceberg/TestRewriteFiles.java:
##
@@ -384,6 +386,116 @@ public void testRewriteDataAndAssignOldSequenceNumber() {
 assertThat(listManifestFiles()).hasSize(4);
   }
 
+  @TestTemplate
+  public void 
testRewriteDataAndAssignOldSequenceNumbersShouldNotDropDeleteFiles() {
+assumeThat(formatVersion)
+.as("Sequence number is only supported in iceberg format v2 or later")
+.isGreaterThan(1);
+assertThat(listManifestFiles()).isEmpty();
+
+commit(table, 
table.newRowDelta().addRows(FILE_A).addDeletes(FILE_A2_DELETES), branch);
+
+long firstRewriteSequenceNumber = latestSnapshot(table, 
branch).sequenceNumber();
+
+commit(
+table,
+
table.newRowDelta().addRows(FILE_B).addRows(FILE_B).addDeletes(FILE_B2_DELETES),
+branch);
+commit(
+table,
+
table.newRowDelta().addRows(FILE_B).addRows(FILE_C).addDeletes(FILE_C2_DELETES),
+branch);
+
+long secondRewriteSequenceNumber = latestSnapshot(table, 
branch).sequenceNumber();
+
+commit(
+table,
+table
+.newRewrite()
+.addFile(FILE_D)
+.deleteFile(FILE_B)
+.deleteFile(FILE_C)
+.dataSequenceNumber(secondRewriteSequenceNumber),
+branch);
+
+TableMetadata base = readMetadata();
+Snapshot baseSnap = latestSnapshot(base, branch);
+long baseSnapshotId = baseSnap.snapshotId();
+
+Comparator sequenceNumberOrdering =
+new Comparator<>() {
+  @Override
+  public int compare(ManifestFile o1, ManifestFile o2) {
+return (int) (o1.sequenceNumber() - o2.sequenceNumber());
+  }
+};
+
+// FILE_B2_DELETES and FILE_A2_DELETES should not be removed as the 
rewrite specifies
+// `firstRewriteSequenceNumber`
+// explicitly which is the same as that of A2_DELETES and before B2_DELETES
+
+// Technically A1_DELETES could be removed since it's an equality delete 
and should apply on

Review Comment:
   Sorry not sure I really follow the comment? There is no 
A1_DELETESthere's a FILE_A_DELETES but that's a positional delete. I don't 
see those referenced in the above operations.
   
   I think it's true though that we should be able to drop equality deletes 
older than the minimum sequence number but that's already happening in the 
existing MergingSnapshotProducer check no? Don't think anything needs to 
distinguish there between equality and positional delete



##
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##
@@ -833,7 +833,17 @@ public List apply(TableMetadata base, 
Snapshot snapshot) {
 filterManager.filterManifests(
 SnapshotUtil.schemaFor(base, targetBranch()),
 snapshot != null ? snapshot.dataManifests(ops.io()) : null);
-long minDataSequenceNumber =
+
+long minNewFileSequenceNumber =
+addedDataFiles().stream()
+.filter(x -> x.dataSequenceNumber() != null && 
x.dataSequenceNumber() >= 0)
+.mapToLong(ContentFile::dataSequenceNumber)
+.reduce(
+newDataFilesDataSequenceNumber != null
+? newDataFilesDataSequenceNumber
+: base.nextSequenceNumber(),
+Math::min);

Review Comment:
   Do we actually need to iterate through the `addedDataFiles`? 
   
   If I understood the issue correctly, the problem is that it's possible for a 
user to commit a rewrite operation and specify an older data sequence number, 
and the current logic would drop delete files which actually need to still be 
referenced in the new commit since it's not considering the specified data file 
sequence number.
   
   So I *think* all we would need to do here is keep the existing logic for 
determining minDataSequenceNumber and then also min that with the 
`newDataFilesDataSequenceNumber` if it's not null
   
   ```
   long minNewDataSequenceNumber = 
   if (newDataFilesDataSequenceNumber != null) {
   minNewDataSequenceNumber = Math.min(minNewDataSequenceNumber, 
newDataFilesDataSequenceNumber);
   } 
   ```



##
core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java:
##
@@ -833,7 +833,17 @@ public List apply(TableMetadata base, 
Snapshot snapshot) {
 filterManager.filterManifests(
 SnapshotUtil.schemaFor(base, targetBranch()),
 snapshot != null ? snapshot.dataManifests(ops.io()) : null);
-long minDataSequenceNumber =
+
+long minNewFileSequenceNumber =
+addedDataFiles().stream()
+.filter(x -> x.dataSequenceNumber() != null && 
x.dataSequenceNumber() >= 0)
+.mapToLong(ContentFile::dataSequenceNumber)
+.reduce(
+newDataFilesDataSequenceNumber != null
+? 

Re: [I] Cannot commit identity partition on datatypes time,timestamp* using 'fromPartitionString' [iceberg]

2024-09-05 Thread via GitHub


amogh-jahagirdar commented on issue #11085:
URL: https://github.com/apache/iceberg/issues/11085#issuecomment-2332705117

   Hey @mderoy check out https://github.com/apache/iceberg/pull/10820 which 
tried to address this but we determined that adding to this Conversions logic 
is probably not the right way to go since Iceberg considers partition value to 
String a 1 way conversion. The other way existed for Hive table migration but 
again probably not good to keep adding cases to that since the string 
representation is not standardized.
   
   
   Before all that, I'm curious what you're trying to do, is this a Hive table 
migration to Iceberg? It doesn't sound like it based on 
   ```
   when trying to insert into a table partitioned on identity for a time 
datatype, we get the following error trying to commit
   Unsupported type for fromPartitionString: time
   ```
   
   If you're inserting into an Iceberg table, this Conversions logic shouldn't 
even come into the picture so I'm a bit confused there...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] New PR label [ready for review] [iceberg-python]

2024-09-05 Thread via GitHub


kevinjqliu closed issue #1123: New PR label [ready for review]
URL: https://github.com/apache/iceberg-python/issues/1123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Remove python 3.8 support [iceberg-python]

2024-09-05 Thread via GitHub


kevinjqliu commented on issue #1121:
URL: 
https://github.com/apache/iceberg-python/issues/1121#issuecomment-2332722899

   I'll draft it up :) thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deprecate ADLFS prefix in favor of ADLS [iceberg-python]

2024-09-05 Thread via GitHub


kevinjqliu commented on PR #961:
URL: https://github.com/apache/iceberg-python/pull/961#issuecomment-2332733081

   @ndrluis do you mind rebasing this PR? Looks like its almost good to go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Bump pydantic from 2.8.2 to 2.9.0 [iceberg-python]

2024-09-05 Thread via GitHub


dependabot[bot] opened a new pull request, #1137:
URL: https://github.com/apache/iceberg-python/pull/1137

   Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.8.2 to 2.9.0.
   
   Release notes
   Sourced from https://github.com/pydantic/pydantic/releases";>pydantic's 
releases.
   
   v2.9.0 (2024-09-05)
   The code released in v2.9.0 is practically identical to that of 
v2.9.0b2.
   Check out our https://pydantic.dev/articles/pydantic-v2-9-release";>blog post to 
learn more about the release highlights!
   What's Changed
   Packaging
   
   Bump ruff to v0.5.0 and pyright 
to v1.1.369 by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/9801";>#9801
   Bump pydantic-extra-types to v2.9.0 by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/9832";>#9832
   Support compatibility with pdm v2.18.1 by https://github.com/Viicos";>@​Viicos in https://redirect.github.com/pydantic/pydantic/pull/10138";>#10138
   Bump v1 version stub to v1.10.18 by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10214";>#10214
   Bump pydantic-core to v2.23.2 by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10311";>#10311
   
   New Features
   
   Add support for ZoneInfo by https://github.com/Youssefares";>@​Youssefares in https://redirect.github.com/pydantic/pydantic/pull/9896";>#9896
   Add Config.val_json_bytes by https://github.com/josh-newman";>@​josh-newman in https://redirect.github.com/pydantic/pydantic/pull/9770";>#9770
   Add DSN for Snowflake by https://github.com/aditkumar72";>@​aditkumar72 in https://redirect.github.com/pydantic/pydantic/pull/10128";>#10128
   Support complex number by https://github.com/changhc";>@​changhc in https://redirect.github.com/pydantic/pydantic/pull/9654";>#9654
   Add support for annotated_types.Not by https://github.com/aditkumar72";>@​aditkumar72 in https://redirect.github.com/pydantic/pydantic/pull/10210";>#10210
   Allow WithJsonSchema to inject $refs w/ 
http or https links by https://github.com/dAIsySHEng1";>@​dAIsySHEng1 in https://redirect.github.com/pydantic/pydantic/pull/9863";>#9863
   Allow validators to customize validation JSON schema by https://github.com/Viicos";>@​Viicos in https://redirect.github.com/pydantic/pydantic/pull/10094";>#10094
   Support parametrized PathLike types by https://github.com/nix010";>@​nix010 in https://redirect.github.com/pydantic/pydantic/pull/9764";>#9764
   Add tagged union serializer that attempts to use str or 
callable discriminators to select the correct serializer by https://github.com/sydney-runkle";>@​sydney-runkle in in 
https://redirect.github.com/pydantic/pydantic-core/pull/1397";>pydantic/pydantic-core#1397
   
   Changes
   
   Breaking Change: Merge dict type 
json_schema_extra by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/9792";>#9792
   
   For more info (how to replicate old behavior) on this change, see https://docs.pydantic.dev/dev/concepts/json_schema/#merging-json_schema_extra";>here
   
   
   Refactor annotation injection for known (often generic) types by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/9979";>#9979
   Move annotation compatibility errors to validation phase by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/";>#
   Improve runtime errors for string constraints like pattern 
for incompatible types by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10158";>#10158
   Remove 'allOf' JSON schema workarounds by https://github.com/dpeachey";>@​dpeachey in https://redirect.github.com/pydantic/pydantic/pull/10029";>#10029
   Remove typed_dict_cls data from CoreMetadata 
by https://github.com/sydney-runkle";>@​sydney-runkle 
in https://redirect.github.com/pydantic/pydantic/pull/10180";>#10180
   Deprecate passing a dict to the Examples class by https://github.com/Viicos";>@​Viicos in https://redirect.github.com/pydantic/pydantic/pull/10181";>#10181
   Remove initial_metadata from internal metadata construct by 
https://github.com/sydney-runkle";>@​sydney-runkle in 
https://redirect.github.com/pydantic/pydantic/pull/10194";>#10194
   Use re.Pattern.search instead of 
re.Pattern.match for consistency with rust behavior 
by https://github.com/tinez";>@​tinez in https://redirect.github.com/pydantic/pydantic-core/pull/1368";>pydantic/pydantic-core#1368
   Show value of wrongly typed data in pydantic-core 
serialization warning by https://github.com/BoxyUwU";>@​BoxyUwU in https://redirect.github.com/pydantic/pydantic-core/pull/1377";>pydantic/pydantic-core#1377
   Breaking Change: in pydantic-core, change 
metadata type hint in core schemas from A

[PR] Bump sqlalchemy from 2.0.32 to 2.0.34 [iceberg-python]

2024-09-05 Thread via GitHub


dependabot[bot] opened a new pull request, #1138:
URL: https://github.com/apache/iceberg-python/pull/1138

   Bumps [sqlalchemy](https://github.com/sqlalchemy/sqlalchemy) from 2.0.32 to 
2.0.34.
   
   Release notes
   Sourced from https://github.com/sqlalchemy/sqlalchemy/releases";>sqlalchemy's 
releases.
   
   2.0.34
   Released: September 4, 2024
   orm
   
   
   [orm] [bug] Fixed regression caused by issue https://www.sqlalchemy.org/trac/ticket/11814";>#11814 which broke 
support for
   certain flavors of https://peps.python.org/pep-0593";>PEP 593 
Annotated in the type_annotation_map when
   builtin types such as list, dict were used without 
an element type.
   While this is an incomplete style of typing, these types nonetheless
   previously would be located in the type_annotation_map correctly.
   References: https://www.sqlalchemy.org/trac/ticket/11831";>#11831
   
   
   sqlite
   
   
   [sqlite] [bug] Fixed regression in SQLite reflection 
caused by https://www.sqlalchemy.org/trac/ticket/11677";>#11677 
which
   interfered with reflection for CHECK constraints that were followed
   by other kinds of constraints within the same table definition.   Pull
   request courtesy Harutaka Kawamura.
   References: https://www.sqlalchemy.org/trac/ticket/11832";>#11832
   
   
   2.0.33
   Released: September 3, 2024
   general
   
   
   [general] [change] The pin for 
setuptools<69.3 in pyproject.toml has been removed.
   This pin was to prevent a sudden change in setuptools to use https://peps.python.org/pep-0625";>PEP 625
   from taking place, which would change the file name of SQLAlchemy's source
   distribution on pypi to be an all lower case name, which is likely to cause
   problems with various build environments that expected the previous naming
   style.  However, the presence of this pin is holding back environments that
   otherwise want to use a newer setuptools, so we've decided to move forward
   with this change, with the assumption that build environments will have
   largely accommodated the setuptools change by now.
   References: https://www.sqlalchemy.org/trac/ticket/11818";>#11818
   
   
   orm
   
   [orm] [bug] [regression] Fixed regression from 1.3 
where the column key used for a hybrid property
   might be populated with that of the underlying column that it returns, for
   a property that returns an ORM mapped column directly, rather than the 
key
   
   
   
   ... (truncated)
   
   
   Commits
   
   See full diff in https://github.com/sqlalchemy/sqlalchemy/commits";>compare view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=sqlalchemy&package-manager=pip&previous-version=2.0.32&new-version=2.0.34)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.

Re: [I] Support to optimize, analyze tables and expire snapshots, remove orphan files [iceberg-python]

2024-09-05 Thread via GitHub


eedduuar commented on issue #31:
URL: https://github.com/apache/iceberg-python/issues/31#issuecomment-2332788362

   Hello, any progress?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Improve Documentation on getting started with GCS [iceberg]

2024-09-05 Thread via GitHub


github-actions[bot] commented on issue #7948:
URL: https://github.com/apache/iceberg/issues/7948#issuecomment-2332934209

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Avoid concurrent commits causing commit failures [iceberg]

2024-09-05 Thread via GitHub


github-actions[bot] commented on PR #8001:
URL: https://github.com/apache/iceberg/pull/8001#issuecomment-2332935581

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] metadata.json delete [iceberg]

2024-09-05 Thread via GitHub


github-actions[bot] commented on issue #8007:
URL: https://github.com/apache/iceberg/issues/8007#issuecomment-2332935606

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Write ordered by within unique physical partitions folder (exclude hash path). [iceberg]

2024-09-05 Thread via GitHub


github-actions[bot] commented on issue #8008:
URL: https://github.com/apache/iceberg/issues/8008#issuecomment-2332935627

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] remove_orphan_files throws reached maximum depth exception in AWS EMR-6.11.0 [iceberg]

2024-09-05 Thread via GitHub


github-actions[bot] commented on issue #8022:
URL: https://github.com/apache/iceberg/issues/8022#issuecomment-2332935648

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [spark 3.4] skip empty file during table migration, table snapshotting or adding files [iceberg]

2024-09-05 Thread via GitHub


github-actions[bot] commented on PR #8040:
URL: https://github.com/apache/iceberg/pull/8040#issuecomment-2332935751

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.3: Adding Rebalance operator solving for small files problem [iceberg]

2024-09-05 Thread via GitHub


github-actions[bot] commented on PR #8042:
URL: https://github.com/apache/iceberg/pull/8042#issuecomment-2332935802

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Add retry on UncheckedIOException and max retries for S3FileIO [iceberg]

2024-09-05 Thread via GitHub


github-actions[bot] commented on PR #8043:
URL: https://github.com/apache/iceberg/pull/8043#issuecomment-2332935830

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Docs: Document accessing instance variables [iceberg]

2024-09-05 Thread via GitHub


aokolnychyi opened a new pull request, #11087:
URL: https://github.com/apache/iceberg/pull/11087

   This PR documents our recommendations for accessing instance variables to 
improve consistency of the code. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Pyiceberg StaticTable use the last metadata json URL when the full path is not provided [iceberg]

2024-09-05 Thread via GitHub


djouallah closed issue #7979: Pyiceberg StaticTable use the last metadata json 
URL when the full path is not provided
URL: https://github.com/apache/iceberg/issues/7979


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Deprecate SparkAppenderFactory [iceberg]

2024-09-05 Thread via GitHub


ajantha-bhat commented on PR #11076:
URL: https://github.com/apache/iceberg/pull/11076#issuecomment-2332990957

   @amoghjaha-db:
   
> Since the class is package private do we want to just remove it upfront

I did get this thought initially and checked how we handled previously for 
other classes in this module. The package private classes like `SparkFileScan` 
was deprecated first and then removed. So, I went ahead and followed the same.

   https://github.com/user-attachments/assets/77ee43f4-aa7c-4d7e-b40e-9296ff21443f";>
   
   
   > It looks like it's currently used in some JMH benchmarks WritersBenchmark, 
and TestSparkMergingMetrics would need to be updated to test the 
SparkFileWriterFactory class instead.
   
   As I mentioned in one of the 
[comments](https://github.com/apache/iceberg/pull/11076#discussion_r1742886689),
 `SparkFileWriterFactory` doesn't have metrics related function. So, we cannot 
update the test to use it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] add more tests for position deletes [iceberg-python]

2024-09-05 Thread via GitHub


sungwy opened a new pull request, #1141:
URL: https://github.com/apache/iceberg-python/pull/1141

   Investigating #1132 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in [iceberg]

2024-09-05 Thread via GitHub


aokolnychyi commented on code in PR #11044:
URL: https://github.com/apache/iceberg/pull/11044#discussion_r1746373228


##
core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java:
##
@@ -46,11 +49,14 @@ static class BaseInheritableMetadata implements 
InheritableMetadata {
 private final int specId;
 private final long snapshotId;
 private final long sequenceNumber;
+private final String manifestPath;
 
-private BaseInheritableMetadata(int specId, long snapshotId, long 
sequenceNumber) {
+private BaseInheritableMetadata(
+int specId, long snapshotId, long sequenceNumber, String manifestPath) 
{

Review Comment:
   Let's do this for now and re-evaluate default values later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add manifestPath API to ContentFile which will return the path to a manifest from which the content file resides in [iceberg]

2024-09-05 Thread via GitHub


aokolnychyi commented on code in PR #11044:
URL: https://github.com/apache/iceberg/pull/11044#discussion_r1746373228


##
core/src/main/java/org/apache/iceberg/InheritableMetadataFactory.java:
##
@@ -46,11 +49,14 @@ static class BaseInheritableMetadata implements 
InheritableMetadata {
 private final int specId;
 private final long snapshotId;
 private final long sequenceNumber;
+private final String manifestPath;
 
-private BaseInheritableMetadata(int specId, long snapshotId, long 
sequenceNumber) {
+private BaseInheritableMetadata(
+int specId, long snapshotId, long sequenceNumber, String manifestPath) 
{

Review Comment:
   Let's do this for now and re-evaluate the approach once the default values 
are in.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] TableMetadataBuilder [iceberg-rust]

2024-09-05 Thread via GitHub


liurenjie1024 commented on PR #587:
URL: https://github.com/apache/iceberg-rust/pull/587#issuecomment-2333006736

   Thanks @c-thiel for this pr, I've skimmed through it and it looks great to 
me. However this pr is too huge to review(3k lines), would you mind to split 
them into smaller onces? For example, we can add one pr for methods involved in 
one 
[`TableUpdate`](https://github.com/apache/iceberg-rust/blob/9862026b9f3c885a82e7b8b8da414c0c97436537/crates/iceberg/src/catalog/mod.rs#L338)
 action and add enough tests for it? Also it would be better to put refactoring 
`TableMetadataBuilder` in a standalone module a pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Discussion: Typesafe(r) properties [iceberg-rust]

2024-09-05 Thread via GitHub


liurenjie1024 commented on issue #599:
URL: https://github.com/apache/iceberg-rust/issues/599#issuecomment-2333037728

   Thanks @c-thiel for raising this. I love this idea of type safe properties 
and believe this is the right direction to go. I took a look at your reference, 
but I didn't get a good understanding of how the api looks like. Would you mind 
to show some code examples of the usage? For example, what rest/sql catalog 
properties would look like? In fact, we can leave macros to last part since 
it's a tool to reduce duplication and maintaince effort.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Why call deleteKey for Insert and Update After in Flink BaseDeltaTaskWriter? [iceberg]

2024-09-05 Thread via GitHub


SML0127 commented on issue #11081:
URL: https://github.com/apache/iceberg/issues/11081#issuecomment-2333065332

   @pvary 
   Thank you for your support. It really helped me a lot.
   Lastly, I have one more question. I am looking at the code 
`ColumnarBatchReader.java`. Where can I see that the equality delete file is 
applied to the previous snapshot and the pos delete file is applied to the 
current snapshot? 🙇🏻‍♂️🙇🏻‍♂️🙇🏻‍♂️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Document accessing instance variables [iceberg]

2024-09-05 Thread via GitHub


manuzhang commented on code in PR #11087:
URL: https://github.com/apache/iceberg/pull/11087#discussion_r1746409783


##
site/docs/contribute.md:
##
@@ -388,6 +388,34 @@ When passing boolean arguments to existing or external 
methods, use inline comme
   dropTable(identifier, purge);
 ```
 
+ Accessing instance variables

Review Comment:
   Curious whether there is a format rule for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: implement IcebergTableProviderFactory for datafusion [iceberg-rust]

2024-09-05 Thread via GitHub


liurenjie1024 commented on code in PR #600:
URL: https://github.com/apache/iceberg-rust/pull/600#discussion_r1746464932


##
crates/integrations/datafusion/src/table/table_provider_factory.rs:
##
@@ -0,0 +1,180 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::collections::HashMap;
+use std::sync::Arc;
+
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider, TableProviderFactory};
+use datafusion::error::Result as DFResult;
+use datafusion::logical_expr::CreateExternalTable;
+use datafusion::sql::TableReference;
+use iceberg::arrow::schema_to_arrow_schema;
+use iceberg::io::FileIO;
+use iceberg::table::StaticTable;
+use iceberg::{Error, ErrorKind, Result, TableIdent};
+
+use super::IcebergTableProvider;
+use crate::to_datafusion_error;
+
+#[derive(Default)]
+#[non_exhaustive]
+pub struct IcebergTableProviderFactory {}
+
+impl IcebergTableProviderFactory {
+pub fn new() -> Self {
+Self {}
+}
+}
+
+#[async_trait]
+impl TableProviderFactory for IcebergTableProviderFactory {
+async fn create(
+&self,
+_state: &dyn Session,
+cmd: &CreateExternalTable,
+) -> DFResult> {
+check_cmd(cmd).map_err(to_datafusion_error)?;
+
+let table_name = &cmd.name;
+let metadata_file_path = &cmd.location;
+let options = &cmd.options;
+
+let table = create_static_table(table_name, metadata_file_path, 
options)
+.await
+.map_err(to_datafusion_error)?
+.into_table();
+
+let schema = schema_to_arrow_schema(table.metadata().current_schema())
+.map_err(to_datafusion_error)?;
+
+Ok(Arc::new(IcebergTableProvider::new(table, Arc::new(schema
+}
+}
+
+fn check_cmd(cmd: &CreateExternalTable) -> Result<()> {
+let CreateExternalTable {
+schema,
+table_partition_cols,
+order_exprs,
+constraints,
+column_defaults,
+..
+} = cmd;
+
+if !schema.fields().is_empty() {
+return Err(Error::new(
+ErrorKind::FeatureUnsupported,
+"Schema specification is not supported",
+));
+}
+
+if !table_partition_cols.is_empty() {
+return Err(Error::new(
+ErrorKind::FeatureUnsupported,
+"Partition columns cannot be specified",
+));
+}
+
+if !order_exprs.is_empty() {
+return Err(Error::new(
+ErrorKind::FeatureUnsupported,
+"Ordering clause is not supported",
+));
+}
+
+if !constraints.is_empty() {
+return Err(Error::new(
+ErrorKind::FeatureUnsupported,
+"Constraints are not supported",
+));
+}
+
+if !column_defaults.is_empty() {
+return Err(Error::new(
+ErrorKind::FeatureUnsupported,
+"Default values for columns are not supported",
+));
+}
+
+Ok(())
+}
+
+async fn create_static_table(
+table_name: &TableReference,
+metadata_file_path: &str,
+props: &HashMap,
+) -> Result {
+let table_ident = TableIdent::from_strs(table_name.to_vec())?;
+let file_io = FileIO::from_path(metadata_file_path)?
+.with_props(props)
+.build()?;
+StaticTable::from_metadata_file(metadata_file_path, table_ident, 
file_io).await
+}
+
+#[cfg(test)]
+mod tests {
+
+use datafusion::catalog::TableProviderFactory;
+use datafusion::common::{Constraints, DFSchema};
+use datafusion::logical_expr::CreateExternalTable;
+use datafusion::prelude::SessionContext;
+use datafusion::sql::TableReference;
+
+use super::*;
+
+#[tokio::test]
+async fn test_schema_of_created_table() {
+let factory = IcebergTableProviderFactory::new();
+
+let metadata_file_path = format!(
+"{}/testdata/table_metadata/{}",
+env!("CARGO_MANIFEST_DIR"),
+"TableMetadataV2.json"
+);
+
+let cmd = CreateExternalTable {
+name: TableReference::partial("static_ns", "static_table"),
+location: metadata_file_path,
+schema: Arc::new(DFSchema::empty()),
+file_type: "iceberg".to_string(),
+opti

Re: [PR] feat: implement IcebergTableProviderFactory for datafusion [iceberg-rust]

2024-09-05 Thread via GitHub


liurenjie1024 commented on code in PR #600:
URL: https://github.com/apache/iceberg-rust/pull/600#discussion_r1746466160


##
crates/integrations/datafusion/src/table/table_provider_factory.rs:
##
@@ -0,0 +1,180 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::collections::HashMap;
+use std::sync::Arc;
+
+use async_trait::async_trait;
+use datafusion::catalog::{Session, TableProvider, TableProviderFactory};
+use datafusion::error::Result as DFResult;
+use datafusion::logical_expr::CreateExternalTable;
+use datafusion::sql::TableReference;
+use iceberg::arrow::schema_to_arrow_schema;
+use iceberg::io::FileIO;
+use iceberg::table::StaticTable;
+use iceberg::{Error, ErrorKind, Result, TableIdent};
+
+use super::IcebergTableProvider;
+use crate::to_datafusion_error;
+
+#[derive(Default)]
+#[non_exhaustive]
+pub struct IcebergTableProviderFactory {}

Review Comment:
   It would be better to provide some doc to explain its usage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Document accessing instance variables [iceberg]

2024-09-05 Thread via GitHub


aokolnychyi commented on code in PR #11087:
URL: https://github.com/apache/iceberg/pull/11087#discussion_r1746503833


##
site/docs/contribute.md:
##
@@ -388,6 +388,34 @@ When passing boolean arguments to existing or external 
methods, use inline comme
   dropTable(identifier, purge);
 ```
 
+ Accessing instance variables

Review Comment:
   I wanted to look into ways to automate this after documenting. I took a look 
now. It doesn't seem there is an existing checkstyle rule that would enforce 
the behavior we are looking for. We may try doing something with regular 
expressions but it may be hard to capture the context.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Table Scan: Add Row Selection Filtering [iceberg-rust]

2024-09-05 Thread via GitHub


sdd commented on code in PR #565:
URL: https://github.com/apache/iceberg-rust/pull/565#discussion_r1746573022


##
crates/iceberg/src/expr/visitors/page_index_evaluator.rs:
##
@@ -0,0 +1,1491 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Evaluates predicates against a Parquet Page Index
+
+use std::collections::HashMap;
+
+use fnv::FnvHashSet;
+use ordered_float::OrderedFloat;
+use parquet::arrow::arrow_reader::{RowSelection, RowSelector};
+use parquet::file::metadata::RowGroupMetaData;
+use parquet::file::page_index::index::Index;
+use parquet::format::PageLocation;
+
+use crate::expr::visitors::bound_predicate_visitor::{visit, 
BoundPredicateVisitor};
+use crate::expr::{BoundPredicate, BoundReference};
+use crate::spec::{Datum, PrimitiveLiteral, PrimitiveType, Schema};
+use crate::{Error, ErrorKind, Result};
+
+type OffsetIndex = Vec>;
+
+const IN_PREDICATE_LIMIT: usize = 200;
+
+enum MissingColBehavior {
+CantMatch,
+MightMatch,
+}
+
+enum PageNullCount {
+AllNull,
+NoneNull,
+SomeNull,
+Unknown,
+}
+
+impl PageNullCount {
+fn from_row_and_null_counts(num_rows: usize, null_count: Option) -> 
Self {
+match (num_rows, null_count) {
+(x, Some(y)) if x == y as usize => PageNullCount::AllNull,
+(_, Some(0)) => PageNullCount::NoneNull,
+(_, Some(_)) => PageNullCount::SomeNull,
+_ => PageNullCount::Unknown,
+}
+}
+}
+
+pub(crate) struct PageIndexEvaluator<'a> {
+column_index: &'a [Index],
+offset_index: &'a OffsetIndex,
+row_group_metadata: &'a RowGroupMetaData,
+iceberg_field_id_to_parquet_column_index: &'a HashMap,
+snapshot_schema: &'a Schema,
+}
+
+impl<'a> PageIndexEvaluator<'a> {
+pub(crate) fn new(
+column_index: &'a [Index],
+offset_index: &'a OffsetIndex,
+row_group_metadata: &'a RowGroupMetaData,
+field_id_map: &'a HashMap,
+snapshot_schema: &'a Schema,
+) -> Self {
+Self {
+column_index,
+offset_index,
+row_group_metadata,
+iceberg_field_id_to_parquet_column_index: field_id_map,
+snapshot_schema,
+}
+}
+
+/// Evaluate this `PageIndexEvaluator`'s filter predicate against a
+/// specific page's column index entry in a parquet file's page index.
+/// [`ArrowReader`] uses the resulting [`RowSelection`] to reject
+/// pages within a parquet file's row group that cannot contain rows
+/// matching the filter predicate.
+pub(crate) fn eval(
+filter: &'a BoundPredicate,
+column_index: &'a [Index],
+offset_index: &'a OffsetIndex,
+row_group_metadata: &'a RowGroupMetaData,
+field_id_map: &'a HashMap,
+snapshot_schema: &'a Schema,
+) -> Result> {
+if row_group_metadata.num_rows() == 0 {
+return Ok(vec![]);
+}
+
+let mut evaluator = Self::new(
+column_index,
+offset_index,
+row_group_metadata,
+field_id_map,
+snapshot_schema,
+);
+
+Ok(visit(&mut evaluator, filter)?.iter().copied().collect())
+}
+
+fn select_all_rows(&self) -> Result {
+Ok(vec![RowSelector::select(
+self.row_group_metadata.num_rows() as usize
+)]
+.into())
+}
+
+fn skip_all_rows(&self) -> Result {
+Ok(vec![].into())
+}
+
+fn calc_row_selection(
+&self,
+field_id: i32,
+predicate: F,
+missing_col_behavior: MissingColBehavior,
+) -> Result
+where
+F: Fn(Option, Option, PageNullCount) -> Result,
+{
+let Some(&parquet_column_index) =
+self.iceberg_field_id_to_parquet_column_index.get(&field_id)
+else {
+// if the snapshot's column is not present in the row group,
+// exit early
+return match missing_col_behavior {
+MissingColBehavior::CantMatch => self.skip_all_rows(),
+MissingColBehavior::MightMatch => self.select_all_rows(),
+};
+};
+
+let Some(field) = self.snapshot_schema.field_by_id(field_id) else 

Re: [PR] Table Scan: Add Row Selection Filtering [iceberg-rust]

2024-09-05 Thread via GitHub


sdd commented on code in PR #565:
URL: https://github.com/apache/iceberg-rust/pull/565#discussion_r1746597564


##
crates/iceberg/src/expr/visitors/page_index_evaluator.rs:
##
@@ -0,0 +1,1491 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Evaluates predicates against a Parquet Page Index
+
+use std::collections::HashMap;
+
+use fnv::FnvHashSet;
+use ordered_float::OrderedFloat;
+use parquet::arrow::arrow_reader::{RowSelection, RowSelector};
+use parquet::file::metadata::RowGroupMetaData;
+use parquet::file::page_index::index::Index;
+use parquet::format::PageLocation;
+
+use crate::expr::visitors::bound_predicate_visitor::{visit, 
BoundPredicateVisitor};
+use crate::expr::{BoundPredicate, BoundReference};
+use crate::spec::{Datum, PrimitiveLiteral, PrimitiveType, Schema};
+use crate::{Error, ErrorKind, Result};
+
+type OffsetIndex = Vec>;
+
+const IN_PREDICATE_LIMIT: usize = 200;
+
+enum MissingColBehavior {
+CantMatch,
+MightMatch,
+}
+
+enum PageNullCount {
+AllNull,
+NoneNull,
+SomeNull,
+Unknown,
+}
+
+impl PageNullCount {
+fn from_row_and_null_counts(num_rows: usize, null_count: Option) -> 
Self {
+match (num_rows, null_count) {
+(x, Some(y)) if x == y as usize => PageNullCount::AllNull,
+(_, Some(0)) => PageNullCount::NoneNull,
+(_, Some(_)) => PageNullCount::SomeNull,
+_ => PageNullCount::Unknown,
+}
+}
+}
+
+pub(crate) struct PageIndexEvaluator<'a> {
+column_index: &'a [Index],
+offset_index: &'a OffsetIndex,
+row_group_metadata: &'a RowGroupMetaData,
+iceberg_field_id_to_parquet_column_index: &'a HashMap,
+snapshot_schema: &'a Schema,
+}
+
+impl<'a> PageIndexEvaluator<'a> {
+pub(crate) fn new(
+column_index: &'a [Index],
+offset_index: &'a OffsetIndex,
+row_group_metadata: &'a RowGroupMetaData,
+field_id_map: &'a HashMap,
+snapshot_schema: &'a Schema,
+) -> Self {
+Self {
+column_index,
+offset_index,
+row_group_metadata,
+iceberg_field_id_to_parquet_column_index: field_id_map,
+snapshot_schema,
+}
+}
+
+/// Evaluate this `PageIndexEvaluator`'s filter predicate against a
+/// specific page's column index entry in a parquet file's page index.
+/// [`ArrowReader`] uses the resulting [`RowSelection`] to reject
+/// pages within a parquet file's row group that cannot contain rows
+/// matching the filter predicate.
+pub(crate) fn eval(
+filter: &'a BoundPredicate,
+column_index: &'a [Index],
+offset_index: &'a OffsetIndex,
+row_group_metadata: &'a RowGroupMetaData,
+field_id_map: &'a HashMap,
+snapshot_schema: &'a Schema,
+) -> Result> {
+if row_group_metadata.num_rows() == 0 {
+return Ok(vec![]);
+}
+
+let mut evaluator = Self::new(
+column_index,
+offset_index,
+row_group_metadata,
+field_id_map,
+snapshot_schema,
+);
+
+Ok(visit(&mut evaluator, filter)?.iter().copied().collect())
+}
+
+fn select_all_rows(&self) -> Result {
+Ok(vec![RowSelector::select(
+self.row_group_metadata.num_rows() as usize
+)]
+.into())
+}
+
+fn skip_all_rows(&self) -> Result {
+Ok(vec![].into())

Review Comment:
   Good spot - fixed. I think the original would work but your suggestion looks 
more intuitive to anyone reading the code.



##
crates/iceberg/src/expr/visitors/page_index_evaluator.rs:
##
@@ -0,0 +1,1491 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed t