Re: [PR] AES GCM Stream changes [iceberg]
ggershinsky commented on code in PR #9453: URL: https://github.com/apache/iceberg/pull/9453#discussion_r1447008929 ## core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java: ## @@ -95,6 +102,10 @@ public void write(byte[] b, int off, int len) throws IOException { @Override public long getPos() throws IOException { Review Comment: Iceberg manifest writer calls getPos after closing the avro stream. Wrong value was returned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Fix errorprone warnings [iceberg]
aokolnychyi commented on code in PR #9419: URL: https://github.com/apache/iceberg/pull/9419#discussion_r1447076282 ## api/src/main/java/org/apache/iceberg/util/CharSequenceWrapper.java: ## @@ -44,6 +44,8 @@ public CharSequence get() { } @Override + // Suppressed errorprone warning due to performance reasons. Review Comment: I think this comment can be interpreted in different ways. We are not really disabling it for performance reasons, aren't we? I wonder whether we can skip it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Fix errorprone warnings [iceberg]
ajantha-bhat commented on code in PR #9419: URL: https://github.com/apache/iceberg/pull/9419#discussion_r1447134418 ## api/src/main/java/org/apache/iceberg/util/CharSequenceWrapper.java: ## @@ -44,6 +44,8 @@ public CharSequence get() { } @Override + // Suppressed errorprone warning due to performance reasons. Review Comment: True. Fixing it will cause performance issues. Suppressing will not cause it. I removed the comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Purge support for Iceberg view [iceberg]
ajantha-bhat commented on issue #9433: URL: https://github.com/apache/iceberg/issues/9433#issuecomment-1884524095 @nk1506: Thanks for reporting. I think we should have this. Else, view metadata files will never be deleted from storage. I will assign this issue to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Failed to assign splits due to the serialized split size [iceberg]
javrasya commented on issue #9410: URL: https://github.com/apache/iceberg/issues/9410#issuecomment-1884525429 I couldn't do this @pvary , the split is far ahead and some time is needed to get there in the application. My local environment is not able to run the app on real data and hit this problematic split on debug mode. It crashes because debug mode is quite a resource consuming mode. Still looking for a way to reproduce it and debu it but Just wanted to let you know the status. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add multi-arg transform support [iceberg]
szehon-ho commented on PR #8579: URL: https://github.com/apache/iceberg/pull/8579#issuecomment-1884585967 > WOW, big congrats on the arrival of your newborn. Thank you so much! > I will resume this work support once I finished my internal project, which I'm leveraging bucketing and sorting to support efficient upsert. Sure, understood. Another possibility, if it will take awhile, is that I can also help with this pr to move it forward and we can be the co-authors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Discussion: Rethink `PrimitiveLiteral`. [iceberg-rust]
JanKaul commented on issue #159: URL: https://github.com/apache/iceberg-rust/issues/159#issuecomment-1884586498 Following @Fokko's reasoning, Decimal is comparable to TimestampZ where the timezone is stored in the type. Similarly the scale of the Decimal is stored in the type. I think it makes sense to think about the use cases for Literal. It is used for partition values and default values. Both require only the physical representation. The scale is actually not needed and Literal(i128) would suffice for these use cases. @liurenjie1024 mentioned error messages as another use cases. That's the only time that the i128 representation might not be suitable. The question is whether the error messages warrant a more complex implementation. Regarding @Fokko's example: Doesn't initially storing the Decimal as a LiteralFloat loose accuracy because the 3.25 is stored as something like 3.2499987. If you then convert it to Decimal, it's inaccurate. Maybe you could use PrimitiveLiteral::String here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]
wooyeong opened a new pull request, #9455: URL: https://github.com/apache/iceberg/pull/9455 Issue: #9450 I've changed SparkTable to use name and effective snapshot id for checking equality. With the previous code I mentioned in #9450, ```diff -return icebergTable.name().equals(that.icebergTable.name()); +return icebergTable.name().equals(that.icebergTable.name()) +&& Objects.equals(branch, that.branch) +&& Objects.equals(snapshotId, that.snapshotId); ``` the two refs with the same effective snapshot id don't get optimized as @ajantha-bhat stated. ```sql SELECT * FROM iceberg_except_test UNION SELECT * FROM iceberg_except_test VERSION AS OF '2024-01-01'; == Parsed Logical Plan == 'Distinct +- 'Union false, false :- 'Project [*] : +- 'UnresolvedRelation [iceberg_except_test], [], false +- 'Project [*] +- 'RelationTimeTravel 'UnresolvedRelation [iceberg_except_test], [], false, 2024-01-01 == Analyzed Logical Plan == id: string, a: string, b: timestamp Distinct +- Union false, false :- Project [id#30, a#31, b#32] : +- SubqueryAlias local.iceberg_except_test : +- RelationV2[id#30, a#31, b#32] local.iceberg_except_test local.iceberg_except_test +- Project [id#33, a#34, b#35] +- SubqueryAlias local.iceberg_except_test +- RelationV2[id#33, a#34, b#35] local.iceberg_except_test local.iceberg_except_test == Optimized Logical Plan == Aggregate [id#30, a#31, b#32], [id#30, a#31, b#32] +- Union false, false :- RelationV2[id#30, a#31, b#32] local.iceberg_except_test +- RelationV2[id#33, a#34, b#35] local.iceberg_except_test ``` With this patch, the same query is optimized as below: ```sql == Parsed Logical Plan == 'Distinct +- 'Union false, false :- 'Project [*] : +- 'UnresolvedRelation [iceberg_except_test], [], false +- 'Project [*] +- 'RelationTimeTravel 'UnresolvedRelation [iceberg_except_test], [], false, 2024-01-01 == Analyzed Logical Plan == id: string, a: string, b: timestamp Distinct +- Union false, false :- Project [id#27, a#28, b#29] : +- SubqueryAlias local.iceberg_except_test : +- RelationV2[id#27, a#28, b#29] local.iceberg_except_test local.iceberg_except_test +- Project [id#30, a#31, b#32] +- SubqueryAlias local.iceberg_except_test +- RelationV2[id#30, a#31, b#32] local.iceberg_except_test local.iceberg_except_test == Optimized Logical Plan == Aggregate [id#27, a#28, b#29], [id#27, a#28, b#29] +- RelationV2[id#27, a#28, b#29] local.iceberg_except_test ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]
ajantha-bhat commented on code in PR #9455: URL: https://github.com/apache/iceberg/pull/9455#discussion_r1447345339 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java: ## @@ -405,15 +407,25 @@ public boolean equals(Object other) { return false; } -// use only name in order to correctly invalidate Spark cache +// use name and effective snapshot id to support time travel SparkTable that = (SparkTable) other; -return icebergTable.name().equals(that.icebergTable.name()); +return icebergTable.name().equals(that.icebergTable.name()) +&& Objects.equals(effectiveSnapshotId(), that.effectiveSnapshotId()); } @Override public int hashCode() { -// use only name in order to correctly invalidate Spark cache -return icebergTable.name().hashCode(); +// use name and effective snapshot id to support time travel +return Objects.hash(icebergTable.name(), effectiveSnapshotId()); + } + + public Long effectiveSnapshotId() { Review Comment: should we have a local variable `effectiveSnapshotId` and initialize the `effectiveSnapshotId` in the constructor instead of computing on every equals() and hashCode() call? ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTable.java: ## @@ -53,4 +53,41 @@ public void testTableEquality() throws NoSuchTableException { assertThat(table1).as("References must be different").isNotSameAs(table2); assertThat(table1).as("Tables must be equivalent").isEqualTo(table2); } + + @TestTemplate + public void testEffectiveSnapshotIdEquality() throws NoSuchTableException { +CatalogManager catalogManager = spark.sessionState().catalogManager(); +TableCatalog catalog = (TableCatalog) catalogManager.catalog(catalogName); +Identifier identifier = Identifier.of(tableIdent.namespace().levels(), tableIdent.name()); + +sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName); + +SparkTable table = (SparkTable) catalog.loadTable(identifier); +final long version1Snapshot = table.effectiveSnapshotId(); +final String version1 = "VERSION_1"; +table.table().manageSnapshots().createTag(version1, version1Snapshot).commit(); + +SparkTable firstSnapshotTable = table.copyWithSnapshotId(version1Snapshot); +SparkTable firstTagTable = table.copyWithBranch(version1); + +sql("UPDATE %s SET data = 'b'", tableName); + +final String version2 = "VERSION_2"; +table.table().manageSnapshots().createTag(version2, table.effectiveSnapshotId()).commit(); + +SparkTable secondTagTable = table.copyWithBranch(version2); + +assertThat(firstSnapshotTable) +.as("References for two different SparkTables must be different") +.isNotSameAs(firstTagTable); +assertThat(firstSnapshotTable) +.as("The different snapshots with same id must be equal") +.isEqualTo(firstTagTable); +assertThat(firstTagTable) +.as("The different snapshots should not match") +.isNotEqualTo(secondTagTable); +assertThat(table) +.as("The SparkTable should points to latest snapshot") +.isEqualTo(secondTagTable); Review Comment: Can we also add a SQL query that queries both the tags and validate the results like you have mentioned in the issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]
gjacoby126 commented on code in PR #9452: URL: https://github.com/apache/iceberg/pull/9452#discussion_r1447388713 ## flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/util/FlinkVersionDetector.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.util; + +import org.apache.flink.streaming.api.datastream.DataStream; + +public class FlinkVersionDetector { + public String version() { +String version = null; +try { + version = getVersionFromJar(); +} catch (Exception e) { + /* we can't detect the exact implementation version from the jar (this can happen if the DataStream class Review Comment: @pvary - Yes, it's a packaging issue. In my use case that reproduced the problem, the two instances of DataStream on the classpath are actually identical, due to unrelocated shading. That normally would cause no issues aside from build warnings. Of course we're fixing that too. Whether the packaging problem causes other bugs or not though, Iceberg's reaction shouldn't be "throw an NPE and crash the Flink pipeline", which is what currently happens and this PR fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1447409989 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -142,9 +143,15 @@ public FileIO io() { @Override protected void doRefresh() { String metadataLocation = null; +Table table = null; + try { - Table table = metaClients.run(client -> client.getTable(database, tableName)); - HiveOperationsBase.validateTableIsIceberg(table, fullName); + table = metaClients.run(client -> client.getTable(database, tableName)); + HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName); + + if (table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { Review Comment: IIRC the decision was to do not use the HMS tableType for this. Shouldn't we use `BaseMetastoreTableOperations.TABLE_TYPE_PROP` property instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: rewrite should drop delete files by data sequence number partition wise [iceberg]
ajantha-bhat commented on code in PR #9454: URL: https://github.com/apache/iceberg/pull/9454#discussion_r1447410277 ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDeleteFilesAction.java: ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.extensions; + +import static org.apache.iceberg.TableProperties.DROP_PARTITION_DELETE_ENABLED; +import static org.apache.iceberg.types.Types.NestedField.optional; + +import java.io.File; +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.Random; +import java.util.UUID; +import java.util.stream.Collectors; +import java.util.stream.IntStream; +import org.apache.hadoop.conf.Configuration; +import org.apache.iceberg.BaseTable; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.FileScanTask; +import org.apache.iceberg.PartitionKey; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.StructLike; +import org.apache.iceberg.Table; +import org.apache.iceberg.TableOperations; +import org.apache.iceberg.TableProperties; +import org.apache.iceberg.actions.RewriteDataFiles; +import org.apache.iceberg.actions.RewriteDataFiles.Result; +import org.apache.iceberg.actions.SizeBasedFileRewriter; +import org.apache.iceberg.data.GenericAppenderFactory; +import org.apache.iceberg.data.GenericRecord; +import org.apache.iceberg.data.Record; +import org.apache.iceberg.deletes.EqualityDeleteWriter; +import org.apache.iceberg.deletes.PositionDelete; +import org.apache.iceberg.deletes.PositionDeleteWriter; +import org.apache.iceberg.encryption.EncryptedFiles; +import org.apache.iceberg.encryption.EncryptedOutputFile; +import org.apache.iceberg.encryption.EncryptionKeyMetadata; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.hadoop.HadoopTables; +import org.apache.iceberg.io.OutputFile; +import org.apache.iceberg.io.OutputFileFactory; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.relocated.com.google.common.collect.Iterables; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.relocated.com.google.common.collect.Streams; +import org.apache.iceberg.spark.SparkTestBase; +import org.apache.iceberg.spark.actions.SparkActions; +import org.apache.iceberg.spark.source.ThreeColumnRecord; +import org.apache.iceberg.types.Types; +import org.apache.iceberg.util.ArrayUtil; +import org.apache.iceberg.util.Pair; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +public class TestRewriteDeleteFilesAction extends SparkTestBase { + + private static final int SCALE = 40; + + private static final HadoopTables TABLES = new HadoopTables(new Configuration()); + private static final Schema SCHEMA = + new Schema( + optional(1, "c1", Types.IntegerType.get()), + optional(2, "c2", Types.StringType.get()), + optional(3, "c3", Types.StringType.get())); + + PartitionSpec partitionSpecC1 = PartitionSpec.builderFor(SCHEMA).identity("c1").build(); + + @Rule public TemporaryFolder temp = new TemporaryFolder(); + + private String tableLocation = null; + + @Before + public void setupTableLocation() throws Exception { +File tableDir = temp.newFolder(); +this.tableLocation = tableDir.toURI().toString(); + } + + private Result rewriteTable(Table table) { +return actions() +.rewriteDataFiles(table) +.option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, Long.toString(Long.MAX_VALUE - 1)) +.option(SizeBasedFileRewriter.REWRITE_ALL, "true") +.execute(); + } + + @Test + public void testRewritePartitionDeletesShouldNotRetain() throws IOException { +// TODO: anothe
Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]
wooyeong commented on code in PR #9455: URL: https://github.com/apache/iceberg/pull/9455#discussion_r1447412492 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java: ## @@ -405,15 +407,25 @@ public boolean equals(Object other) { return false; } -// use only name in order to correctly invalidate Spark cache +// use name and effective snapshot id to support time travel SparkTable that = (SparkTable) other; -return icebergTable.name().equals(that.icebergTable.name()); +return icebergTable.name().equals(that.icebergTable.name()) +&& Objects.equals(effectiveSnapshotId(), that.effectiveSnapshotId()); } @Override public int hashCode() { -// use only name in order to correctly invalidate Spark cache -return icebergTable.name().hashCode(); +// use name and effective snapshot id to support time travel +return Objects.hash(icebergTable.name(), effectiveSnapshotId()); + } + + public Long effectiveSnapshotId() { Review Comment: When you have `snapshotId` or `branch`, you can create it in advance. However, when you have neither, you should calculate `icebergTable.currentSnapshot()` every time as it can be changed. Moreover, `icebergTable.currentSnapshot()` can be null(for example, at [this point](https://github.com/apache/iceberg/blob/4d34398cfd32465222f55df522fcd5a2db59c92c/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTable.java#L53)), so we need to null check manually. I'll update `SparkTable` and the test case a little bit. ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTable.java: ## @@ -53,4 +53,41 @@ public void testTableEquality() throws NoSuchTableException { assertThat(table1).as("References must be different").isNotSameAs(table2); assertThat(table1).as("Tables must be equivalent").isEqualTo(table2); } + + @TestTemplate + public void testEffectiveSnapshotIdEquality() throws NoSuchTableException { +CatalogManager catalogManager = spark.sessionState().catalogManager(); +TableCatalog catalog = (TableCatalog) catalogManager.catalog(catalogName); +Identifier identifier = Identifier.of(tableIdent.namespace().levels(), tableIdent.name()); + +sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName); + +SparkTable table = (SparkTable) catalog.loadTable(identifier); +final long version1Snapshot = table.effectiveSnapshotId(); +final String version1 = "VERSION_1"; +table.table().manageSnapshots().createTag(version1, version1Snapshot).commit(); + +SparkTable firstSnapshotTable = table.copyWithSnapshotId(version1Snapshot); +SparkTable firstTagTable = table.copyWithBranch(version1); + +sql("UPDATE %s SET data = 'b'", tableName); + +final String version2 = "VERSION_2"; +table.table().manageSnapshots().createTag(version2, table.effectiveSnapshotId()).commit(); + +SparkTable secondTagTable = table.copyWithBranch(version2); + +assertThat(firstSnapshotTable) +.as("References for two different SparkTables must be different") +.isNotSameAs(firstTagTable); +assertThat(firstSnapshotTable) +.as("The different snapshots with same id must be equal") +.isEqualTo(firstTagTable); +assertThat(firstTagTable) +.as("The different snapshots should not match") +.isNotEqualTo(secondTagTable); +assertThat(table) +.as("The SparkTable should points to latest snapshot") +.isEqualTo(secondTagTable); Review Comment: Great, I'll add some SQL query tests below. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]
ajantha-bhat commented on code in PR #9455: URL: https://github.com/apache/iceberg/pull/9455#discussion_r1447415016 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java: ## @@ -405,15 +407,25 @@ public boolean equals(Object other) { return false; } -// use only name in order to correctly invalidate Spark cache +// use name and effective snapshot id to support time travel SparkTable that = (SparkTable) other; -return icebergTable.name().equals(that.icebergTable.name()); +return icebergTable.name().equals(that.icebergTable.name()) +&& Objects.equals(effectiveSnapshotId(), that.effectiveSnapshotId()); } @Override public int hashCode() { -// use only name in order to correctly invalidate Spark cache -return icebergTable.name().hashCode(); +// use name and effective snapshot id to support time travel +return Objects.hash(icebergTable.name(), effectiveSnapshotId()); + } + + public Long effectiveSnapshotId() { Review Comment: Ack -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]
wooyeong commented on code in PR #9455: URL: https://github.com/apache/iceberg/pull/9455#discussion_r1447417783 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java: ## @@ -405,15 +407,25 @@ public boolean equals(Object other) { return false; } -// use only name in order to correctly invalidate Spark cache +// use name and effective snapshot id to support time travel SparkTable that = (SparkTable) other; -return icebergTable.name().equals(that.icebergTable.name()); +return icebergTable.name().equals(that.icebergTable.name()) +&& Objects.equals(effectiveSnapshotId(), that.effectiveSnapshotId()); } @Override public int hashCode() { -// use only name in order to correctly invalidate Spark cache -return icebergTable.name().hashCode(); +// use name and effective snapshot id to support time travel +return Objects.hash(icebergTable.name(), effectiveSnapshotId()); + } + + public Long effectiveSnapshotId() { Review Comment: Please note the below statements, the `table`'s `currentSnapshot` is changed. ```java assertThat(table).as("The SparkTable points to latest snapshot").isEqualTo(firstTagTable); assertThat(table).as("The SparkTable points to latest snapshot").isEqualTo(secondTagTable); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1447424619 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java: ## @@ -0,0 +1,306 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Collections; +import java.util.Locale; +import java.util.Map; +import java.util.Objects; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.InvalidObjectException; +import org.apache.hadoop.hive.metastore.api.NoSuchObjectException; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.CommitFailedException; +import org.apache.iceberg.exceptions.CommitStateUnknownException; +import org.apache.iceberg.exceptions.NoSuchIcebergViewException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.util.MetastoreOperationsUtil; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** Hive implementation of Iceberg ViewOperations. */ +final class HiveViewOperations extends BaseViewOperations implements HiveOperationsBase { + private static final Logger LOG = LoggerFactory.getLogger(HiveViewOperations.class); + + private final String fullName; + private final String database; + private final String viewName; + private final FileIO fileIO; + private final ClientPool metaClients; + private final long maxHiveTablePropertySize; + + HiveViewOperations( + Configuration conf, + ClientPool metaClients, + FileIO fileIO, + String catalogName, + TableIdentifier viewIdentifier) { +String dbName = viewIdentifier.namespace().level(0); +this.metaClients = metaClients; +this.fileIO = fileIO; +this.fullName = BaseMetastoreCatalog.fullTableName(catalogName, viewIdentifier); +this.database = dbName; +this.viewName = viewIdentifier.name(); +this.maxHiveTablePropertySize = +conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT); + } + + @Override + public void doRefresh() { +String metadataLocation = null; +Table table; + +try { + table = metaClients.run(client -> client.getTable(database, viewName)); + HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName); + + if (!table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { +throw new NoSuchObjectException(); + } Review Comment: Why not throw the final exception immediately? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1447426135 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java: ## @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Collections; +import java.util.Locale; +import java.util.Objects; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.InvalidObjectException; +import org.apache.hadoop.hive.metastore.api.NoSuchObjectException; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.CommitFailedException; +import org.apache.iceberg.exceptions.CommitStateUnknownException; +import org.apache.iceberg.exceptions.NoSuchIcebergViewException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.util.MetastoreOperationsUtil; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** Hive implementation of Iceberg ViewOperations. */ +final class HiveViewOperations extends BaseViewOperations implements HiveOperationsBase { + private static final Logger LOG = LoggerFactory.getLogger(HiveViewOperations.class); + + private final String fullName; + private final String database; + private final String viewName; + private final FileIO fileIO; + private final ClientPool metaClients; + private final long maxHiveTablePropertySize; + + HiveViewOperations( + Configuration conf, + ClientPool metaClients, + FileIO fileIO, + String catalogName, + TableIdentifier viewIdentifier) { +String dbName = viewIdentifier.namespace().level(0); +this.metaClients = metaClients; +this.fileIO = fileIO; +this.fullName = BaseMetastoreCatalog.fullTableName(catalogName, viewIdentifier); +this.database = dbName; +this.viewName = viewIdentifier.name(); +this.maxHiveTablePropertySize = +conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT); + } + + @Override + public void doRefresh() { +String metadataLocation = null; +Table table = null; + +try { + table = metaClients.run(client -> client.getTable(database, viewName)); + HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName); + metadataLocation = + table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP); + +} catch (NoSuchObjectException e) { + if (currentMetadataLocation() != null) { +throw new NoSuchViewException("View does not exist: %s.%s", database, viewName); + } +} catch (TException e) { + String errMsg = + String.format("Failed to get view info from metastore %s.%s", database, viewName); + throw new RuntimeException(errMsg, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted during refresh", e); +} + +if (table != null && !table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { + disableRefresh(); +} else { + refreshFromMetadataLocation(metadataLocation); +} + } + + @SuppressWarnings("checkstyle:CyclomaticComplexity") + @Override + public void doCommit(ViewMetadata base, ViewMetadata metadata) { Review Comment: Why not use commit lock for view creation, modification? We could have the same concurrency issues here, that with the tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo
Re: [PR] Hive: Unwrap RuntimeException for Hive InvalidOperationException with rename table [iceberg]
pvary commented on code in PR #9432: URL: https://github.com/apache/iceberg/pull/9432#discussion_r1447431727 ## hive-metastore/src/main/java/org/apache/iceberg/hive/MetastoreUtil.java: ## @@ -72,6 +73,23 @@ public static void alterTable( env.putAll(extraEnv); env.put(StatsSetupConst.DO_NOT_UPDATE_STATS, StatsSetupConst.TRUE); -ALTER_TABLE.invoke(client, databaseName, tblName, table, new EnvironmentContext(env)); +try { + ALTER_TABLE.invoke(client, databaseName, tblName, table, new EnvironmentContext(env)); +} catch (RuntimeException e) { Review Comment: I made a mistake in the above command, I thought, that the root exception is `MetaException`, but in reality it is `TException`. So fixing this issue, my comment correctly is below: - The `alterTable` wraps the exceptions to a `RuntimeException` because of the `DynMethod` calls. I think this is a mistake. We should behave the same way as the `metaClients.run(client -> client...)` like calls, we should throw the original `TException` - my opinion here could be changes if this becomes a breaking change somewhere, but if it does not break code for someone, then we should change it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Fix errorprone warnings [iceberg]
Fokko merged PR #9419: URL: https://github.com/apache/iceberg/pull/9419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Unwrap RuntimeException for Hive InvalidOperationException with rename table [iceberg]
pvary commented on code in PR #9432: URL: https://github.com/apache/iceberg/pull/9432#discussion_r1447431727 ## hive-metastore/src/main/java/org/apache/iceberg/hive/MetastoreUtil.java: ## @@ -72,6 +73,23 @@ public static void alterTable( env.putAll(extraEnv); env.put(StatsSetupConst.DO_NOT_UPDATE_STATS, StatsSetupConst.TRUE); -ALTER_TABLE.invoke(client, databaseName, tblName, table, new EnvironmentContext(env)); +try { + ALTER_TABLE.invoke(client, databaseName, tblName, table, new EnvironmentContext(env)); +} catch (RuntimeException e) { Review Comment: I made a mistake in the above command, I thought, that the root exception is `MetaException`, but in reality it is `TException`. So fixing this issue, my comment correctly is below: - The `alterTable` wraps the exceptions to a `RuntimeException` because of the `DynMethod` calls. I think this is a mistake. We should behave the same way as the `metaClients.run(client -> client...)` like calls, we should throw the original `TException` - my opinion here could be changes if this becomes a breaking change somewhere, but if it does not break code for someone, then we should change it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
Fokko commented on PR #8701: URL: https://github.com/apache/iceberg/pull/8701#issuecomment-1884916386 Since there are no further comments, I'll go ahead and merge this. I would like to express my gratitude to @bryanck for working on this since this will help so many people in the Kafka community to get their data in Iceberg in a fast and reliable way! 🙏 Thanks @ajantha-bhat, @danielcweeks, @rdblue, @jbonofre, @ajantha-bhat and @nastra for the review 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
Fokko merged PR #8701: URL: https://github.com/apache/iceberg/pull/8701 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
jbonofre commented on PR #8701: URL: https://github.com/apache/iceberg/pull/8701#issuecomment-1884919080 @Fokko awesome, thanks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Kafka Connect within Iceberg [iceberg]
Fokko closed issue #4977: Support Kafka Connect within Iceberg URL: https://github.com/apache/iceberg/issues/4977 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Kafka Connect within Iceberg [iceberg]
Fokko commented on issue #4977: URL: https://github.com/apache/iceberg/issues/4977#issuecomment-1884930862 Hey everyone, the Kafka connect sink by Tabular has been donated to Iceberg in https://github.com/apache/iceberg/pull/8701. I go ahead and close this issue, feel free to open up a new one if there are any further questions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink1.12.1 +Iceberg0.12.0 has problems with real-time reading and writing in upsert mode [iceberg]
Fokko closed issue #3277: Flink1.12.1 +Iceberg0.12.0 has problems with real-time reading and writing in upsert mode URL: https://github.com/apache/iceberg/issues/3277 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink1.12.1 +Iceberg0.12.0 has problems with real-time reading and writing in upsert mode [iceberg]
Fokko commented on issue #3277: URL: https://github.com/apache/iceberg/issues/3277#issuecomment-1884931848 This has been fixed in a later version, I'll go ahead and close this for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Snowflake Iceberg Partitioned data read issue [iceberg]
purna344 commented on issue #9404: URL: https://github.com/apache/iceberg/issues/9404#issuecomment-1885023931 If the producers write the data in storage by setting the below config value `spark.conf.set("spark.databricks.delta.writePartitionColumnsToParquet", "false")` Then *.parquet file does not have the partition columns related information and partition values are stored in the file path. It is not possible for us to ask producers don't set this config value in their spark jobs and publish the data. I heard that Iceberg format expect the partition values in the parquet file. How to handle this scenario and does iceberg support any config parameter for to read the partition values from the folder path? CC: @amogh-jahagirdar -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests in SQL directory to JUnit5 [iceberg]
nastra merged PR #9401: URL: https://github.com/apache/iceberg/pull/9401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
nastra commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447600190 ## core/src/test/java/org/apache/iceberg/TestBase.java: ## @@ -173,7 +173,7 @@ public class TestBase { public TestTables.TestTable table = null; @Parameters(name = "formatVersion = {0}") - protected static List parameters() { + protected static List parameters() { Review Comment: why is this change necessary? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]
gjacoby126 commented on code in PR #9452: URL: https://github.com/apache/iceberg/pull/9452#discussion_r1447601555 ## flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/util/FlinkVersionDetector.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.util; + +import org.apache.flink.streaming.api.datastream.DataStream; + +public class FlinkVersionDetector { + public String version() { +String version = null; +try { + version = getVersionFromJar(); +} catch (Exception e) { + /* we can't detect the exact implementation version from the jar (this can happen if the DataStream class Review Comment: Oh, and just fyi, I'm not affiliated with the engineers who posted the original issue this patch fixes; I just came across it while I was investigating. So this has happened at least twice in real-world use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
nastra commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447603337 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/TestSparkDistributedDataScanDeletes.java: ## @@ -21,42 +21,39 @@ import static org.apache.iceberg.PlanningMode.DISTRIBUTED; import static org.apache.iceberg.PlanningMode.LOCAL; +import java.util.Arrays; +import java.util.List; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; import org.apache.iceberg.spark.SparkReadConf; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.internal.SQLConf; -import org.junit.AfterClass; -import org.junit.Before; -import org.junit.BeforeClass; -import org.junit.runner.RunWith; -import org.junit.runners.Parameterized; +import org.junit.jupiter.api.AfterAll; +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.extension.ExtendWith; -@RunWith(Parameterized.class) +@ExtendWith(ParameterizedTestExtension.class) public class TestSparkDistributedDataScanDeletes extends DeleteFileIndexTestBase> { - @Parameterized.Parameters(name = "dataMode = {0}, deleteMode = {1}") - public static Object[] parameters() { -return new Object[][] { - new Object[] {LOCAL, LOCAL}, - new Object[] {LOCAL, DISTRIBUTED}, - new Object[] {DISTRIBUTED, LOCAL}, - new Object[] {DISTRIBUTED, DISTRIBUTED} -}; + @Parameters(name = "formatVersion = {0}, dataMode = {1}, deleteMode = {2}") + public static List parameters() { Review Comment: ```suggestion public static List parameters() { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
nastra commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447602835 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/SparkDistributedDataScanTestBase.java: ## @@ -21,46 +21,41 @@ import static org.apache.iceberg.PlanningMode.DISTRIBUTED; import static org.apache.iceberg.PlanningMode.LOCAL; +import java.util.Arrays; +import java.util.List; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; import org.apache.iceberg.spark.SparkReadConf; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.internal.SQLConf; -import org.junit.Before; -import org.junit.runner.RunWith; -import org.junit.runners.Parameterized; -import org.junit.runners.Parameterized.Parameters; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.extension.ExtendWith; -@RunWith(Parameterized.class) +@ExtendWith(ParameterizedTestExtension.class) public abstract class SparkDistributedDataScanTestBase extends DataTableScanTestBase> { @Parameters(name = "formatVersion = {0}, dataMode = {1}, deleteMode = {2}") - public static Object[] parameters() { -return new Object[][] { - new Object[] {1, LOCAL, LOCAL}, - new Object[] {1, LOCAL, DISTRIBUTED}, - new Object[] {1, DISTRIBUTED, LOCAL}, - new Object[] {1, DISTRIBUTED, DISTRIBUTED}, - new Object[] {2, LOCAL, LOCAL}, - new Object[] {2, LOCAL, DISTRIBUTED}, - new Object[] {2, DISTRIBUTED, LOCAL}, - new Object[] {2, DISTRIBUTED, DISTRIBUTED} -}; + public static List parameters() { Review Comment: ```suggestion public static List parameters() { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]
pvary commented on code in PR #9452: URL: https://github.com/apache/iceberg/pull/9452#discussion_r1447604053 ## flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/util/FlinkVersionDetector.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.util; + +import org.apache.flink.streaming.api.datastream.DataStream; + +public class FlinkVersionDetector { + public String version() { +String version = null; +try { + version = getVersionFromJar(); +} catch (Exception e) { + /* we can't detect the exact implementation version from the jar (this can happen if the DataStream class Review Comment: Why did we get the `NullPointerException` in this case? By my experience if we have the class on the classpath multiple times, Java just uses the first one it founds. Was it because the shading messed up with the metadata of the jar? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
nastra commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447604611 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/TestSparkDistributedDataScanReporting.java: ## @@ -21,41 +21,38 @@ import static org.apache.iceberg.PlanningMode.DISTRIBUTED; import static org.apache.iceberg.PlanningMode.LOCAL; +import java.util.Arrays; +import java.util.List; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; import org.apache.iceberg.spark.SparkReadConf; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.internal.SQLConf; -import org.junit.AfterClass; -import org.junit.BeforeClass; -import org.junit.runner.RunWith; -import org.junit.runners.Parameterized; +import org.junit.jupiter.api.AfterAll; +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.extension.ExtendWith; -@RunWith(Parameterized.class) +@ExtendWith(ParameterizedTestExtension.class) public class TestSparkDistributedDataScanReporting extends ScanPlanningAndReportingTestBase> { - @Parameterized.Parameters(name = "dataMode = {0}, deleteMode = {1}") - public static Object[] parameters() { -return new Object[][] { - new Object[] {LOCAL, LOCAL}, - new Object[] {LOCAL, DISTRIBUTED}, - new Object[] {DISTRIBUTED, LOCAL}, - new Object[] {DISTRIBUTED, DISTRIBUTED} -}; + @Parameters(name = "formatVersion = {0}, dataMode = {1}, deleteMode = {2}") + public static List parameters() { Review Comment: ```suggestion public static List parameters() { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
nastra commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447605228 ## data/src/test/java/org/apache/iceberg/io/TestGenericSortedPosDeleteWriter.java: ## @@ -62,7 +62,7 @@ public class TestGenericSortedPosDeleteWriter extends TestBase { private Record gRecord; @Parameters(name = "formatVersion = {0}, fileFormat = {1}") - public static List parameters() { + public static List parameters() { Review Comment: this should be reverted -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
nastra commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447607114 ## core/src/test/java/org/apache/iceberg/TestLocalFilterFiles.java: ## @@ -18,20 +18,17 @@ */ package org.apache.iceberg; -import org.junit.runner.RunWith; -import org.junit.runners.Parameterized; +import java.util.Arrays; +import java.util.List; +import org.junit.jupiter.api.extension.ExtendWith; -@RunWith(Parameterized.class) +@ExtendWith(ParameterizedTestExtension.class) public class TestLocalFilterFiles extends FilterFilesTestBase { - @Parameterized.Parameters(name = "formatVersion = {0}") - public static Object[] parameters() { -return new Object[] {1, 2}; - } - - public TestLocalFilterFiles(int formatVersion) { -super(formatVersion); + @Parameters(name = "formatVersion = {0}") + public static List parameters() { Review Comment: ```suggestion public static List parameters() { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
nastra commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447606490 ## core/src/test/java/org/apache/iceberg/FilterFilesTestBase.java: ## @@ -24,67 +24,67 @@ import java.io.File; import java.io.IOException; import java.nio.ByteBuffer; +import java.nio.file.Files; +import java.nio.file.Path; import java.util.Map; import org.apache.iceberg.expressions.Expressions; import org.apache.iceberg.relocated.com.google.common.collect.Iterables; import org.apache.iceberg.relocated.com.google.common.collect.Maps; import org.apache.iceberg.types.Conversions; import org.apache.iceberg.types.Types; -import org.junit.After; -import org.junit.Before; -import org.junit.Rule; -import org.junit.Test; -import org.junit.rules.TemporaryFolder; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.TestTemplate; +import org.junit.jupiter.api.extension.ExtendWith; +import org.junit.jupiter.api.io.TempDir; +@ExtendWith(ParameterizedTestExtension.class) public abstract class FilterFilesTestBase< ScanT extends Scan, T extends ScanTask, G extends ScanTaskGroup> { - public final int formatVersion; - - public FilterFilesTestBase(int formatVersion) { -this.formatVersion = formatVersion; - } + @Parameter(index = 0) + public int formatVersion; Review Comment: does this need to be public or can we switch it to protected? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]
nastra commented on code in PR #9381: URL: https://github.com/apache/iceberg/pull/9381#discussion_r1447617468 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java: ## @@ -33,30 +36,27 @@ import org.apache.iceberg.FileFormat; import org.apache.iceberg.Table; import org.apache.iceberg.TestHelpers; -import org.apache.iceberg.catalog.Namespace; import org.apache.iceberg.catalog.TableIdentifier; import org.apache.iceberg.data.GenericAppenderHelper; import org.apache.iceberg.data.GenericRecord; import org.apache.iceberg.data.Record; -import org.apache.iceberg.flink.FlinkCatalogTestBase; +import org.apache.iceberg.flink.CatalogTestBase; import org.apache.iceberg.flink.MiniClusterResource; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; import org.apache.iceberg.relocated.com.google.common.collect.Lists; import org.assertj.core.api.Assertions; -import org.junit.After; -import org.junit.Assert; -import org.junit.Before; -import org.junit.Test; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.TestTemplate; +import org.junit.jupiter.api.io.TempDir; -public class TestStreamScanSql extends FlinkCatalogTestBase { +public class TestStreamScanSql extends CatalogTestBase { private static final String TABLE = "test_table"; private static final FileFormat FORMAT = FileFormat.PARQUET; private TableEnvironment tEnv; - public TestStreamScanSql(String catalogName, Namespace baseNamespace) { -super(catalogName, baseNamespace); - } + private @TempDir Path temp; Review Comment: no need to define this, you should be able to use `temporaryDirectory` from the super class. However, I noticed that you might need to change the visibility of `Testbase.temporaryDirectory` to protected -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]
nastra commented on code in PR #9381: URL: https://github.com/apache/iceberg/pull/9381#discussion_r1447617985 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java: ## @@ -127,20 +127,16 @@ private void insertRows(Table table, Row... rows) throws IOException { private void assertRows(List expectedRows, Iterator iterator) { for (Row expectedRow : expectedRows) { - Assert.assertTrue("Should have more records", iterator.hasNext()); - + assertThat(iterator.hasNext()).isTrue(); Review Comment: ```suggestion assertThat(iterator).hasNext(); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Kafka Connect within Iceberg [iceberg]
ajantha-bhat commented on issue #4977: URL: https://github.com/apache/iceberg/issues/4977#issuecomment-1885176312 > Hey everyone, the Kafka connect sink by Tabular has been donated to Iceberg in https://github.com/apache/iceberg/pull/8701. I go ahead and close this issue, feel free to open up a new one if there are any further questions. Note: Just the initial PR got merged. Still a long way to go. But yeah, we can close this issue mentioning it is already supported from https://github.com/tabular-io/iceberg-kafka-connect and is also in the process of donation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]
gjacoby126 commented on code in PR #9452: URL: https://github.com/apache/iceberg/pull/9452#discussion_r1447629862 ## flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/util/FlinkVersionDetector.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.util; + +import org.apache.flink.streaming.api.datastream.DataStream; + +public class FlinkVersionDetector { + public String version() { +String version = null; +try { + version = getVersionFromJar(); +} catch (Exception e) { + /* we can't detect the exact implementation version from the jar (this can happen if the DataStream class Review Comment: I don't know for sure, but the shading messing up the metadata is my best guess too. In addition to the unit test in the patch, I did confirm that the patch resolves the issue in a real Flink cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on PR #8701: URL: https://github.com/apache/iceberg/pull/8701#issuecomment-1885241611 Awesome! Thanks all for the feedback and guidance. I'll follow up with PRs for the actual sink portion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] [Bug fix] Ensure string env var s3.connect-timeout is cast to float [iceberg-python]
syun64 opened a new pull request, #259: URL: https://github.com/apache/iceberg-python/pull/259 ``` %env PYICEBERG_CATALOG__LACUS__S3.CONNECT-TIMEOUT=60 from pyiceberg.catalog import load_catalog catalog = load_catalog("test") tbl = catalog.load_table("test.test") tbl.scan().to_arrow().to_pandas() ``` Stacktrace: ``` TypeError Traceback (most recent call last) /tmp/ipykernel_734/2954162570.py in () 1 tbl = catalog.load_table("test.table") > 2 tbl.scan().to_arrow().to_pandas() /layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/table/__init__.py in to_arrow(self) 1290 1291 return project_table( -> 1292 self.plan_files(), 1293 self.table, 1294 self.row_filter, /layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/table/__init__.py in plan_files(self) 1232 manifests = [ 1233 manifest_file -> 1234 for manifest_file in snapshot.manifests(io) 1235 if manifest_evaluators[manifest_file.partition_spec_id](manifest_file) 1236 ] /layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/table/snapshots.py in manifests(self, io) 157 def manifests(self, io: FileIO) -> List[ManifestFile]: 158 if self.manifest_list is not None: --> 159 file = io.new_input(self.manifest_list) 160 return list(read_manifest_list(file)) 161 return [] /layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py in new_input(self, location) 392 scheme, netloc, path = self.parse_location(location) 393 return PyArrowFile( --> 394 fs=self.fs_by_scheme(scheme, netloc), 395 location=location, 396 path=path, /layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py in _initialize_fs(self, scheme, netloc) 340 client_kwargs["connect_timeout"] = connect_timeout 341 --> 342 return S3FileSystem(**client_kwargs) 343 elif scheme == "hdfs": 344 from pyarrow.fs import HadoopFileSystem /layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyarrow/_s3fs.pyx in pyarrow._s3fs.S3FileSystem.__init__() TypeError: must be real number, not str ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [Bug fix] Ensure string env var s3.connect-timeout is cast to float [iceberg-python]
Fokko commented on PR #259: URL: https://github.com/apache/iceberg-python/pull/259#issuecomment-1885265016 @syun64 Can you fix the `mypy` violation: ``` pyiceberg/io/pyarrow.py:336: error: Incompatible types in assignment (expression has type "float", target has type "Optional[str]") [assignment] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Spark: Support min/max/count push down for partition columns [iceberg]
xiaoxuandev opened a new pull request, #9457: URL: https://github.com/apache/iceberg/pull/9457 ### Notes Support min/max/count aggregate push down for partition columns - min/max/count aggregate push down is not working if partition columns don't present as data columns(the stats won't be present in avro files), so even the aggregate has been push down to data source, `AggregateEvaluator` will fail, it still go through full table scan - add support by updating evaluator based on PartitionData ### Testing Creating a hive table: CREATE EXTERNAL TABLE store_sales (id int, data INT) PARTITIONED BY (ss_sold_date_sk INT) then registered as Iceberg table Tested on Spark 3.5, verified count/min/max been successfully pushdown, and simple queries (`select count(ss_sold_date_sk) from store_sales` , `select min(ss_sold_date_sk) from store_sales` and `select max(ss_sold_date_sk) from store_sales`) has been speed up with the change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [Bug fix] Ensure string env var s3.connect-timeout is cast to float [iceberg-python]
Fokko commented on PR #259: URL: https://github.com/apache/iceberg-python/pull/259#issuecomment-1885335933 Thanks for fixing this @syun64 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [Bug fix] Ensure string env var s3.connect-timeout is cast to float [iceberg-python]
Fokko merged PR #259: URL: https://github.com/apache/iceberg-python/pull/259 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support partitioned writes [iceberg-python]
jqin61 commented on issue #208: URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1885365903 > In Iceberg it can be that some files are still on an older partitioning, we should make sure that we handle those correctly based on the that we provide. It seems Spark's iceberg support has such overwrite behaviors under schema evolution: - dynamic overwrite: data files generated from old partition spec will not be replaced even if some of the records match the overwriting data - static overwrite with PARTITION values specified: same as above - static overwrite without PARTITION values: all data is deleted regardless of what partition specs they conform to. As Fokko mentioned, we need to make sure in the implementation we use the latest partition spec_id when overwriting partitions so that the data in the old partition spec is not touched. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Data: Errors in some file readers do not report the file in which they failed [iceberg]
RussellSpitzer opened a new issue, #9458: URL: https://github.com/apache/iceberg/issues/9458 ### Feature Request / Improvement There are several places in our code currently where a failure while reading a file will throw an exception but the exception will not contain any information related to which file was being read during the failure. The Avro reader is an example of this. When planning a table scan with a corrupted manifest the user will end up with an exception like ```java org.apache.iceberg.exceptions.RuntimeIOException: Failed to read next record at org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.next(AvroIterable.java:204) at org.apache.iceberg.io.CloseableIterable$7$1.next(CloseableIterable.java:202) at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65) at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49) at org.apache.iceberg.io.CloseableIterable$7$1.hasNext(CloseableIterable.java:197) at org.apache.iceberg.io.CloseableIterator$2.hasNext(CloseableIterator.java:72) at org.apache.iceberg.io.ClosingIterator.hasNext(ClosingIterator.java:39) at scala.collection.convert.JavaCollectionWrappers$JIteratorWrapper.hasNext(JavaCollectionWrappers.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:593) at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) at ``` This is obviously not very useful since we do not know which one it is. I think we should go into our AvroIterable (and other file format readers) and make sure that when they include the file path in the error. ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]
vinitpatni commented on code in PR #9381: URL: https://github.com/apache/iceberg/pull/9381#discussion_r1447849833 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java: ## @@ -33,30 +36,27 @@ import org.apache.iceberg.FileFormat; import org.apache.iceberg.Table; import org.apache.iceberg.TestHelpers; -import org.apache.iceberg.catalog.Namespace; import org.apache.iceberg.catalog.TableIdentifier; import org.apache.iceberg.data.GenericAppenderHelper; import org.apache.iceberg.data.GenericRecord; import org.apache.iceberg.data.Record; -import org.apache.iceberg.flink.FlinkCatalogTestBase; +import org.apache.iceberg.flink.CatalogTestBase; import org.apache.iceberg.flink.MiniClusterResource; import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; import org.apache.iceberg.relocated.com.google.common.collect.Lists; import org.assertj.core.api.Assertions; -import org.junit.After; -import org.junit.Assert; -import org.junit.Before; -import org.junit.Test; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.TestTemplate; +import org.junit.jupiter.api.io.TempDir; -public class TestStreamScanSql extends FlinkCatalogTestBase { +public class TestStreamScanSql extends CatalogTestBase { private static final String TABLE = "test_table"; private static final FileFormat FORMAT = FileFormat.PARQUET; private TableEnvironment tEnv; - public TestStreamScanSql(String catalogName, Namespace baseNamespace) { -super(catalogName, baseNamespace); - } + private @TempDir Path temp; Review Comment: ack -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]
vinitpatni commented on code in PR #9381: URL: https://github.com/apache/iceberg/pull/9381#discussion_r1447853965 ## flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java: ## @@ -127,20 +127,16 @@ private void insertRows(Table table, Row... rows) throws IOException { private void assertRows(List expectedRows, Iterator iterator) { for (Row expectedRow : expectedRows) { - Assert.assertTrue("Should have more records", iterator.hasNext()); - + assertThat(iterator.hasNext()).isTrue(); Review Comment: ack -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]
vinitpatni commented on PR #9381: URL: https://github.com/apache/iceberg/pull/9381#issuecomment-1885563043 - Addressing Review Comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] If iceberg's parquet data file contains an array of time type, it cannot be read by spark library even after dropping this column [iceberg]
huan233usc commented on issue #9446: URL: https://github.com/apache/iceberg/issues/9446#issuecomment-1885597211 Hi, if someone could guide me on some information about the history of batchedReaderFunc vs ReaderFunc and some related testing code path, happy to work on the fix for that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate remaining tests in source directory to JUnit5 [iceberg]
nastra commented on code in PR #9380: URL: https://github.com/apache/iceberg/pull/9380#discussion_r1447970015 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java: ## @@ -44,57 +48,58 @@ import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; -import org.assertj.core.api.Assertions; -import org.junit.AfterClass; -import org.junit.Assert; -import org.junit.BeforeClass; -import org.junit.Rule; -import org.junit.Test; -import org.junit.rules.TemporaryFolder; -import org.junit.runner.RunWith; -import org.junit.runners.Parameterized; - -@RunWith(Parameterized.class) +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.TestTemplate; +import org.junit.jupiter.api.extension.ExtendWith; +import org.junit.jupiter.api.io.TempDir; + +@ExtendWith(ParameterizedTestExtension.class) public class TestSnapshotSelection { - @Parameterized.Parameters(name = "planningMode = {0}") + @Parameters(name = "properties = {0}") public static Object[] parameters() { -return new Object[] {LOCAL, DISTRIBUTED}; +return new Object[][] { + { +ImmutableMap.of( +TableProperties.DATA_PLANNING_MODE, LOCAL.modeName(), +TableProperties.DELETE_PLANNING_MODE, LOCAL.modeName()) + }, + { +ImmutableMap.of( +TableProperties.DATA_PLANNING_MODE, DISTRIBUTED.modeName(), +TableProperties.DELETE_PLANNING_MODE, DISTRIBUTED.modeName()) + } +}; } private static final Configuration CONF = new Configuration(); private static final Schema SCHEMA = new Schema( optional(1, "id", Types.IntegerType.get()), optional(2, "data", Types.StringType.get())); - @Rule public TemporaryFolder temp = new TemporaryFolder(); + @TempDir private Path temp; private static SparkSession spark = null; - private final Map properties; - - public TestSnapshotSelection(PlanningMode planningMode) { -this.properties = -ImmutableMap.of( -TableProperties.DATA_PLANNING_MODE, planningMode.modeName(), -TableProperties.DELETE_PLANNING_MODE, planningMode.modeName()); - } + @Parameter(index = 0) + private Map properties; - @BeforeClass - public static void startSpark() { + @BeforeEach + public void startSpark() { Review Comment: previously this was started/stopped at the class-level. Any particular reason why we start/stop this now at the test method level? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate remaining tests in source directory to JUnit5 [iceberg]
nastra commented on code in PR #9380: URL: https://github.com/apache/iceberg/pull/9380#discussion_r1447971372 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java: ## @@ -123,15 +126,9 @@ public static Object[][] parameters() { }; } - public TestPositionDeletesTable( - String catalogName, String implementation, Map config, FileFormat format) { -super(catalogName, implementation, config); -this.format = format; - } - - @Rule public TemporaryFolder temp = new TemporaryFolder(); + @TempDir private Path temp; Review Comment: is this needed? I think you should be able to use the one defined in the super class -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] iceberg reports an error after upgrading to 1.4.2 [iceberg]
ZachDischner commented on issue #9018: URL: https://github.com/apache/iceberg/issues/9018#issuecomment-1885807415 I am also seeing this issue. I have existing Iceberg tables, for which a large number of Spark SQL queries simply fail once I use more updated libraries. My existing tables were created and updated using Spark on EMR over the past year. I can recreate only on modern EMR/Iceberg environments for the same queries that run on previous ones. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/Iceberg-release-history.html emr-6.10.0 - Reads and Writes work emr-6.10.1 - Reads and Writes work ... emr-6.14.0 - Reads and Writes work emr-6.15.0 - Some reads _don't_ work emr-7.0.0 - Some reads _don't_ work The cutoff appears to be Iceberg version `1.4.0`+. I'm not sure if this helps, working with an obfuscated example. The situation I'm seeing where a query that fails includes many CTEs, and the error only appears with a particular one. ``` spark.sql(""" WITH a as (SELECT * FROM table WHERE ), b as (SELECT * FROM table2 WHERE ) ... joins, filters, etc z as (SELECT * FROM a union b union c join d...) SELECT * FROM z""" ) ``` An intermediate CTE is where the error manifests. I cannot tell anything about it that is immediately suspicious ``` m AS (SELECT col1, col2, col3, col4, ... FROM l) ``` Such that `SELECT * FROM l` succeeds, but `SELECT * FROM m` fails. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Build: Bump cython from 3.0.7 to 3.0.8 [iceberg-python]
dependabot[bot] opened a new pull request, #260: URL: https://github.com/apache/iceberg-python/pull/260 Bumps [cython](https://github.com/cython/cython) from 3.0.7 to 3.0.8. Changelog Sourced from https://github.com/cython/cython/blob/master/CHANGES.rst";>cython's changelog. 3.0.8 (2024-01-10) Bugs fixed Using const together with defined fused types could fail to compile. (Github issue :issue:5230) A "use after free" bug was fixed in parallel sections. (Github issue :issue:5922) Several types were not available as cython.* types in pure Python code. The generated code is now correct C89 again, removing some C++ style // comments and C99-style declaration-after-code code ordering. This is still relevant for some ols C compilers, specifically ones that match old Python 2.7 installations. Commits https://github.com/cython/cython/commit/a1b79a6bc5326406ad73af73f5b41e3bb5f8da6e";>a1b79a6 Prepare release of 3.0.8. https://github.com/cython/cython/commit/b9bfa7f0492f4f71af1f034822fd90dd4ed3638e";>b9bfa7f Fix parsing of ptrdiff_t in PyrexTypes and add another "all types in Shadow.p... https://github.com/cython/cython/commit/f974ec15b643dfb6338c0aef90976424d5a6bd2c";>f974ec1 Update changelog. https://github.com/cython/cython/commit/ffe6fa7fe47e6b5005c0aad7c6ae7144b3402f33";>ffe6fa7 Avoid C99-isms. https://github.com/cython/cython/commit/356495be50773262b09e158115aba0afe167cb97";>356495b Avoid C99-ism. https://github.com/cython/cython/commit/b85be7e838318d251fc3d3fbfdf1a1ecf5a515fd";>b85be7e Avoid C99-ism. https://github.com/cython/cython/commit/30a6534a2279eddfd7a8b75de94897357eeaba2b";>30a6534 Fix some C99-isms in 3.0.x branch. https://github.com/cython/cython/commit/9866ce478fa76a07babff8a5d9d53a4030eaf9e7";>9866ce4 Avoid C99-ism. https://github.com/cython/cython/commit/6990d6e24f89949c4cf32d143d8c20e516d93756";>6990d6e Use Py3.6 instead of Py3.7/8/9 for the C89 build since CPython switched to C9... https://github.com/cython/cython/commit/d3b92b0a4207ee070405d6f1ab8a75e95b53c56a";>d3b92b0 Use Py3.7 instead of Py3.9 for the C89 build since CPython switched to C99 at... Additional commits viewable in https://github.com/cython/cython/compare/3.0.7...3.0.8";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Hive memory issue with reading iceberg v2 from hive [iceberg]
github-actions[bot] commented on issue #6784: URL: https://github.com/apache/iceberg/issues/6784#issuecomment-1885963748 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Delta name mapping to Iceberg conversion [iceberg]
github-actions[bot] closed issue #6768: Support Delta name mapping to Iceberg conversion URL: https://github.com/apache/iceberg/issues/6768 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Delta name mapping to Iceberg conversion [iceberg]
github-actions[bot] commented on issue #6768: URL: https://github.com/apache/iceberg/issues/6768#issuecomment-1885963774 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Hive memory issue with reading iceberg v2 from hive [iceberg]
github-actions[bot] closed issue #6784: Hive memory issue with reading iceberg v2 from hive URL: https://github.com/apache/iceberg/issues/6784 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Create JUnit5 version of TableTestBase [iceberg]
lisirrx commented on code in PR #9217: URL: https://github.com/apache/iceberg/pull/9217#discussion_r1448172377 ## core/src/test/java/org/apache/iceberg/TestManifestReader.java: ## @@ -32,17 +32,15 @@ import org.apache.iceberg.types.Types; import org.assertj.core.api.Assertions; import org.assertj.core.api.recursive.comparison.RecursiveComparisonConfiguration; -import org.junit.Assert; -import org.junit.Assume; -import org.junit.Test; -import org.junit.runner.RunWith; -import org.junit.runners.Parameterized; - -@RunWith(Parameterized.class) -public class TestManifestReader extends TableTestBase { - @Parameterized.Parameters(name = "formatVersion = {0}") - public static Object[] parameters() { -return new Object[] {1, 2}; +import org.junit.jupiter.api.Assumptions; +import org.junit.jupiter.api.TestTemplate; +import org.junit.jupiter.api.extension.ExtendWith; + +@ExtendWith(ParameterizedTestExtension.class) Review Comment: @nastra I noticed you have merged #9424, so should this pr be closed? BTW, Should #9085 be split into multiple tasks? I want to work on it but there are too many tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Discussion: Rethink `PrimitiveLiteral`. [iceberg-rust]
liurenjie1024 commented on issue #159: URL: https://github.com/apache/iceberg-rust/issues/159#issuecomment-1886064501 > @liurenjie1024 mentioned error messages as another use cases. That's the only time that the i128 representation might not be suitable. The question is whether the error messages warrant a more complex implementation. > > Regarding @Fokko's example: Doesn't initially storing the Decimal as a LiteralFloat loose accuracy because the 3.25 is stored as something like 3.2499987. If you then convert it to Decimal, it's inaccurate. Maybe you could use PrimitiveLiteral::String here. Error message is just an example, not all use cases. For example when we convert unbound expression to bound expression, how do we know its original scale? String is enough for storing decimal, or everything, but it maybe weird in api, since with only a string we don't know its original type, e.g. user may write an unbound expression like `a < "3.23"`, where `a` is a decimal and it's legal to compare it with string. I do admit that introducing another enum will be difficult maintain, maybe the solution suggested by @ZENOTME is great: ``` struct Datum { typ: PrimitiveType, literal: PrimitiveLiteral } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]
liurenjie1024 commented on PR #160: URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886079899 cc @Fokko PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]
Fokko commented on PR #160: URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886115997 Looks like there are conflicts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]
liurenjie1024 commented on PR #160: URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886155116 cc @hiirrxnn Could you rebase with main branch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]
wooyeong commented on PR #9455: URL: https://github.com/apache/iceberg/pull/9455#issuecomment-1886170959 I need your opinion regarding [failed tests](https://github.com/apache/iceberg/actions/runs/7475896517/job/20353547835?pr=9455). Previously SparkTable used only `name` intentionally according to [the original code's comment](https://github.com/apache/iceberg/blob/53a1c8671dd1b9b93f4a857230008c812d79ddbf/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L408), to invalidate cache when needed such as [`rollback_to_snapshot`](https://iceberg.apache.org/docs/1.3.1/spark-procedures/#rollback_to_snapshot) (_This procedure invalidates all cached Spark plans that reference the affected table._). However, this patch introduced effectiveSnapshotId, it can't invalidate cache in this case because the altered table will have another snapshot id. - Cache table at snapshot 1 - Alter table, now it has snapshot 2 - Try to invalidate the cache for the table, but they have different snapshot ids(cached plan's id: 1 != new plan's id: 2) so it cannot be invalidated. Therefore, I think it's better to use my previous approach, checking `name`, `branch`, and `snapshotId` altogether. Whenever `snapshotId` or `branch` is supplied, it points to a fixed snapshot point to the table. Thus we don't need to invalidate cache even if the table is modified. In general cases, it acts like before, we can invalidate the cache when necessary. The downside of this approach is that we may lose optimization chances. The optimizer doesn't know both sub-queries are the same thing, so it will fetch both of them, but that **does not affect** the query result. ```sql SELECT * FROM iceberg_except_test UNION SELECT * FROM iceberg_except_test VERSION AS OF '2024-01-01'; -- optimized plan will be == Optimized Logical Plan == Aggregate [id#30, a#31, b#32], [id#30, a#31, b#32] +- Union false, false :- RelationV2[id#30, a#31, b#32] local.iceberg_except_test +- RelationV2[id#33, a#34, b#35] local.iceberg_except_test -- but the result will be > 1 b 2024-01-01 00:00:00 ``` What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]
hiirrxnn commented on PR #160: URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886208226 But the PR with main branch has been closed . Should i do it anyway? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]
hiirrxnn commented on PR #160: URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886231297 Also , feature anyway has all the changes of main since it has been built on top of it , or should i rebase main with feature? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Migrate subclasses of FlinkCatalogTestBase to JUnit5 [iceberg]
vinitpatni commented on PR #9381: URL: https://github.com/apache/iceberg/pull/9381#issuecomment-1886232790 > LGTM, thanks @vinitpatni. Could you also please remove `FlinkCatalogTestBase` as part of this PR as I don't think it's used anymore, since everything was converted. Ack. Removed FlinkCatalogTestBase and its references -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate remaining tests in source directory to JUnit5 [iceberg]
chinmay-bhat commented on code in PR #9380: URL: https://github.com/apache/iceberg/pull/9380#discussion_r1448306197 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java: ## @@ -44,57 +48,58 @@ import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; -import org.assertj.core.api.Assertions; -import org.junit.AfterClass; -import org.junit.Assert; -import org.junit.BeforeClass; -import org.junit.Rule; -import org.junit.Test; -import org.junit.rules.TemporaryFolder; -import org.junit.runner.RunWith; -import org.junit.runners.Parameterized; - -@RunWith(Parameterized.class) +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.TestTemplate; +import org.junit.jupiter.api.extension.ExtendWith; +import org.junit.jupiter.api.io.TempDir; + +@ExtendWith(ParameterizedTestExtension.class) public class TestSnapshotSelection { - @Parameterized.Parameters(name = "planningMode = {0}") + @Parameters(name = "properties = {0}") public static Object[] parameters() { -return new Object[] {LOCAL, DISTRIBUTED}; +return new Object[][] { + { +ImmutableMap.of( +TableProperties.DATA_PLANNING_MODE, LOCAL.modeName(), +TableProperties.DELETE_PLANNING_MODE, LOCAL.modeName()) + }, + { +ImmutableMap.of( +TableProperties.DATA_PLANNING_MODE, DISTRIBUTED.modeName(), +TableProperties.DELETE_PLANNING_MODE, DISTRIBUTED.modeName()) + } +}; } private static final Configuration CONF = new Configuration(); private static final Schema SCHEMA = new Schema( optional(1, "id", Types.IntegerType.get()), optional(2, "data", Types.StringType.get())); - @Rule public TemporaryFolder temp = new TemporaryFolder(); + @TempDir private Path temp; private static SparkSession spark = null; - private final Map properties; - - public TestSnapshotSelection(PlanningMode planningMode) { -this.properties = -ImmutableMap.of( -TableProperties.DATA_PLANNING_MODE, planningMode.modeName(), -TableProperties.DELETE_PLANNING_MODE, planningMode.modeName()); - } + @Parameter(index = 0) + private Map properties; - @BeforeClass - public static void startSpark() { + @BeforeEach + public void startSpark() { Review Comment: no reason, changed it to `BeforeAll` and `AfterAll` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate remaining tests in source directory to JUnit5 [iceberg]
chinmay-bhat commented on code in PR #9380: URL: https://github.com/apache/iceberg/pull/9380#discussion_r1448306351 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java: ## @@ -123,15 +126,9 @@ public static Object[][] parameters() { }; } - public TestPositionDeletesTable( - String catalogName, String implementation, Map config, FileFormat format) { -super(catalogName, implementation, config); -this.format = format; - } - - @Rule public TemporaryFolder temp = new TemporaryFolder(); + @TempDir private Path temp; Review Comment: updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
chinmay-bhat commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1448313906 ## core/src/test/java/org/apache/iceberg/TestBase.java: ## @@ -173,7 +173,7 @@ public class TestBase { public TestTables.TestTable table = null; @Parameters(name = "formatVersion = {0}") - protected static List parameters() { + protected static List parameters() { Review Comment: Because for some tests, we only send 1 `Object` in the list, which is satisfied by `List`. For ex: from `DeleteFileIndexTestBase` ``` @Parameters(name = "formatVersion = {0}") public static List parameters() { return Arrays.asList(new Object[] {2}); } ``` Having List does not allow for the above case as it always expects a List of Object arrays, where the Object arrays size cant be 1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]
chinmay-bhat commented on code in PR #9416: URL: https://github.com/apache/iceberg/pull/9416#discussion_r1448315751 ## core/src/test/java/org/apache/iceberg/FilterFilesTestBase.java: ## @@ -24,67 +24,67 @@ import java.io.File; import java.io.IOException; import java.nio.ByteBuffer; +import java.nio.file.Files; +import java.nio.file.Path; import java.util.Map; import org.apache.iceberg.expressions.Expressions; import org.apache.iceberg.relocated.com.google.common.collect.Iterables; import org.apache.iceberg.relocated.com.google.common.collect.Maps; import org.apache.iceberg.types.Conversions; import org.apache.iceberg.types.Types; -import org.junit.After; -import org.junit.Before; -import org.junit.Rule; -import org.junit.Test; -import org.junit.rules.TemporaryFolder; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.TestTemplate; +import org.junit.jupiter.api.extension.ExtendWith; +import org.junit.jupiter.api.io.TempDir; +@ExtendWith(ParameterizedTestExtension.class) public abstract class FilterFilesTestBase< ScanT extends Scan, T extends ScanTask, G extends ScanTaskGroup> { - public final int formatVersion; - - public FilterFilesTestBase(int formatVersion) { -this.formatVersion = formatVersion; - } + @Parameter(index = 0) + public int formatVersion; Review Comment: can be switched to protected. updating -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]
liurenjie1024 commented on PR #160: URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886347244 > Also , feature anyway has all the changes of main since it has been built on top of it , or should i rebase main with feature? Hi, @hiirrxnn You can learn how to resolve conflicts here: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
nk1506 commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1448339394 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -142,9 +143,15 @@ public FileIO io() { @Override protected void doRefresh() { String metadataLocation = null; +Table table = null; + try { - Table table = metaClients.run(client -> client.getTable(database, tableName)); - HiveOperationsBase.validateTableIsIceberg(table, fullName); + table = metaClients.run(client -> client.getTable(database, tableName)); + HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName); + + if (table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { Review Comment: I have segregated Iceberg [table](https://github.com/apache/iceberg/pull/8907/files#diff-e502621d52f86cf0ec3187dda30ac61f6b76efb7b6276bc8d233ccb2c836fb98R151) and Iceberg [View](https://github.com/apache/iceberg/pull/8907/files#diff-db46657b84d66e084e15f31b8dab21577efb2ae7102863f94c6c9477782de676R83) check here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
nk1506 commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1448341650 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java: ## @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Collections; +import java.util.Locale; +import java.util.Objects; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.InvalidObjectException; +import org.apache.hadoop.hive.metastore.api.NoSuchObjectException; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreCatalog; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.CommitFailedException; +import org.apache.iceberg.exceptions.CommitStateUnknownException; +import org.apache.iceberg.exceptions.NoSuchIcebergViewException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.util.MetastoreOperationsUtil; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** Hive implementation of Iceberg ViewOperations. */ +final class HiveViewOperations extends BaseViewOperations implements HiveOperationsBase { + private static final Logger LOG = LoggerFactory.getLogger(HiveViewOperations.class); + + private final String fullName; + private final String database; + private final String viewName; + private final FileIO fileIO; + private final ClientPool metaClients; + private final long maxHiveTablePropertySize; + + HiveViewOperations( + Configuration conf, + ClientPool metaClients, + FileIO fileIO, + String catalogName, + TableIdentifier viewIdentifier) { +String dbName = viewIdentifier.namespace().level(0); +this.metaClients = metaClients; +this.fileIO = fileIO; +this.fullName = BaseMetastoreCatalog.fullTableName(catalogName, viewIdentifier); +this.database = dbName; +this.viewName = viewIdentifier.name(); +this.maxHiveTablePropertySize = +conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT); + } + + @Override + public void doRefresh() { +String metadataLocation = null; +Table table = null; + +try { + table = metaClients.run(client -> client.getTable(database, viewName)); + HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName); + metadataLocation = + table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP); + +} catch (NoSuchObjectException e) { + if (currentMetadataLocation() != null) { +throw new NoSuchViewException("View does not exist: %s.%s", database, viewName); + } +} catch (TException e) { + String errMsg = + String.format("Failed to get view info from metastore %s.%s", database, viewName); + throw new RuntimeException(errMsg, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted during refresh", e); +} + +if (table != null && !table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { + disableRefresh(); +} else { + refreshFromMetadataLocation(metadataLocation); +} + } + + @SuppressWarnings("checkstyle:CyclomaticComplexity") + @Override + public void doCommit(ViewMetadata base, ViewMetadata metadata) { Review Comment: I thought adding lock for view creation is expensive and unnecessary. Since view is more like a read definition. With View we are not doing any data write. Please share your thoughts. -- This is an automated message from the Apache Git Service. To respon
Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]
ajantha-bhat commented on PR #9455: URL: https://github.com/apache/iceberg/pull/9455#issuecomment-1886371354 Yeah. I need to think bit more. Looks like just using `lazyFixedSnapshotId` in equals and hashcode is enough instead of effectiveSnapshotID (which uses the current snapshot). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Data: Errors in some file readers do not report the file in which they failed [iceberg]
yyy1000 commented on issue #9458: URL: https://github.com/apache/iceberg/issues/9458#issuecomment-1886386123 I'd like to help with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] init writer framework [iceberg-rust]
liurenjie1024 commented on code in PR #135: URL: https://github.com/apache/iceberg-rust/pull/135#discussion_r1448357734 ## crates/iceberg/src/writer/file_writer/mod.rs: ## @@ -0,0 +1,51 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Iceberg File Writer + +use super::{CurrentFileStatus, IcebergWriteResult}; +use crate::Result; +use arrow_array::RecordBatch; +use arrow_schema::SchemaRef; + +/// File writer builder trait. +#[async_trait::async_trait] +pub trait FileWriterBuilder: Send + Clone + 'static { +/// The associated file writer type. +type R: FileWriter; +/// Build file writer. +async fn build(self, schema: &SchemaRef) -> Result; +} + +/// File writer focus on writing record batch to different physical file format.(Such as parquet. orc) +#[async_trait::async_trait] +pub trait FileWriter: Send + 'static + CurrentFileStatus { +/// The associated file write result type. +type R: FileWriteResult; +/// Write record batch to file. +async fn write(&mut self, batch: &RecordBatch) -> Result<()>; +/// Close file writer. +async fn close(self) -> Result>; +} + +/// File write result. +pub trait FileWriteResult: Send + 'static { +/// The associated iceberg write result type. +type R: IcebergWriteResult; +/// Convert to iceberg write result. +fn to_iceberg_result(self) -> Self::R; Review Comment: Yes, I also think the `FileWriteResult`/`IcebergWriterResult` is a little too complicated. What `FileWriter` returns is just a partial `DataFile`, so a `DataFileBuilder` would be enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] If iceberg's parquet data file contains an array of time type, it cannot be read by spark library even after dropping this column [iceberg]
ajantha-bhat commented on issue #9446: URL: https://github.com/apache/iceberg/issues/9446#issuecomment-1886396291 > Hi, if someone could guide me on some information about the history of batchedReaderFunc vs ReaderFunc and some related testing code path, happy to work on the fix for that. Not worked on that area before. I would wait for others answers. From a high level code review, `batchedReaderFunc` uses `VectorizedReader` to read a batch (few rows) of vectors (individual column is on vector). Where as `ReaderFunc` looks like a reader for one type of column -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] If iceberg's parquet data file contains an array of time type, it cannot be read by spark library even after dropping this column [iceberg]
ajantha-bhat commented on issue #9446: URL: https://github.com/apache/iceberg/issues/9446#issuecomment-1886398118 If you look at the caller of `Parquet.ReadBuilder.createReaderFunc`, you can find the testcases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Caused by: java.net.SocketException: Connection reset [iceberg]
javrasya commented on issue #9444: URL: https://github.com/apache/iceberg/issues/9444#issuecomment-1886422859 This happens more often when consumption rate is high which is like replaying historical messages. When I run it in unbounded streaming mode and use `INCREMENTAL_FROM_EARLIEST_SNAPSHOT` streaming strategy instead of batch mode, the consumption rate drops inherently and this error occurs way less in a way that my app fails but recovers and continues and reaches to the end, very slowly but I will take it. Could this be happening because S3 is throttling or something, is there anyone else observed anything like this before? **Note:** The upstream is committing every minute which means that we are having new snapshot every minute which can also lead too many small files and this service which is having the respective error in the original post might be needing to pull too many files and eventually hitting that connection reset issue. This is just a theory, I couldn't verify it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org