Re: [PR] AES GCM Stream changes [iceberg]

2024-01-10 Thread via GitHub


ggershinsky commented on code in PR #9453:
URL: https://github.com/apache/iceberg/pull/9453#discussion_r1447008929


##
core/src/main/java/org/apache/iceberg/encryption/AesGcmOutputStream.java:
##
@@ -95,6 +102,10 @@ public void write(byte[] b, int off, int len) throws 
IOException {
 
   @Override
   public long getPos() throws IOException {

Review Comment:
   Iceberg manifest writer calls getPos after closing the avro stream. Wrong 
value was returned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Fix errorprone warnings [iceberg]

2024-01-10 Thread via GitHub


aokolnychyi commented on code in PR #9419:
URL: https://github.com/apache/iceberg/pull/9419#discussion_r1447076282


##
api/src/main/java/org/apache/iceberg/util/CharSequenceWrapper.java:
##
@@ -44,6 +44,8 @@ public CharSequence get() {
   }
 
   @Override
+  // Suppressed errorprone warning due to performance reasons.

Review Comment:
   I think this comment can be interpreted in different ways. We are not really 
disabling it for performance reasons, aren't we? I wonder whether we can skip 
it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Fix errorprone warnings [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on code in PR #9419:
URL: https://github.com/apache/iceberg/pull/9419#discussion_r1447134418


##
api/src/main/java/org/apache/iceberg/util/CharSequenceWrapper.java:
##
@@ -44,6 +44,8 @@ public CharSequence get() {
   }
 
   @Override
+  // Suppressed errorprone warning due to performance reasons.

Review Comment:
   True. Fixing it will cause performance issues. Suppressing will not cause 
it. I removed the comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Purge support for Iceberg view [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on issue #9433:
URL: https://github.com/apache/iceberg/issues/9433#issuecomment-1884524095

   @nk1506: Thanks for reporting. I think we should have this. Else, view 
metadata files will never be deleted from storage. 
   
   I will assign this issue to you.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Failed to assign splits due to the serialized split size [iceberg]

2024-01-10 Thread via GitHub


javrasya commented on issue #9410:
URL: https://github.com/apache/iceberg/issues/9410#issuecomment-1884525429

   I couldn't do this @pvary , the split is far ahead and some time is needed 
to get there in the application. My local environment is not able to run the 
app on real data and hit this problematic split on debug mode. It crashes 
because debug mode is quite a resource consuming mode. Still looking for a way 
to reproduce it and debu it but Just wanted to let you know the status. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add multi-arg transform support [iceberg]

2024-01-10 Thread via GitHub


szehon-ho commented on PR #8579:
URL: https://github.com/apache/iceberg/pull/8579#issuecomment-1884585967

   > WOW, big congrats on the arrival of your newborn.
   
   Thank you so much!  
   
   > I will resume this work support once I finished my internal project, which 
I'm leveraging bucketing and sorting to support efficient upsert.
   
   Sure, understood.  Another possibility, if it will take awhile, is that I 
can also help with this pr to move it forward and we can be the co-authors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Discussion: Rethink `PrimitiveLiteral`. [iceberg-rust]

2024-01-10 Thread via GitHub


JanKaul commented on issue #159:
URL: https://github.com/apache/iceberg-rust/issues/159#issuecomment-1884586498

   Following @Fokko's reasoning, Decimal is comparable to TimestampZ where the 
timezone is stored in the type. Similarly the scale of the Decimal is stored in 
the type.
   
   I think it makes sense to think about the use cases for Literal. It is used 
for partition values and default values. Both require only the physical 
representation. The scale is actually not needed and Literal(i128) would 
suffice for these use cases.
   
   @liurenjie1024 mentioned error messages as another use cases. That's the 
only time that the i128 representation might not be suitable. The question is 
whether the error messages warrant a more complex implementation.
   
   Regarding @Fokko's example: Doesn't initially storing the Decimal as a 
LiteralFloat loose accuracy because the 3.25 is stored as something like 
3.2499987. If you then convert it to Decimal, it's inaccurate. Maybe you 
could use PrimitiveLiteral::String here.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

2024-01-10 Thread via GitHub


wooyeong opened a new pull request, #9455:
URL: https://github.com/apache/iceberg/pull/9455

   Issue: #9450 
   
   I've changed SparkTable to use name and effective snapshot id for checking 
equality.
   
   With the previous code I mentioned in #9450,
   ```diff
   -return icebergTable.name().equals(that.icebergTable.name());
   +return icebergTable.name().equals(that.icebergTable.name())
   +&& Objects.equals(branch, that.branch)
   +&& Objects.equals(snapshotId, that.snapshotId);
   ```
   
   the two refs with the same effective snapshot id don't get optimized as 
@ajantha-bhat stated.
   ```sql
   SELECT * FROM iceberg_except_test
   UNION
   SELECT * FROM iceberg_except_test
   VERSION AS OF '2024-01-01';
   
   == Parsed Logical Plan ==
   'Distinct
   +- 'Union false, false
  :- 'Project [*]
  :  +- 'UnresolvedRelation [iceberg_except_test], [], false
  +- 'Project [*]
 +- 'RelationTimeTravel 'UnresolvedRelation [iceberg_except_test], [], 
false, 2024-01-01
   
   == Analyzed Logical Plan ==
   id: string, a: string, b: timestamp
   Distinct
   +- Union false, false
  :- Project [id#30, a#31, b#32]
  :  +- SubqueryAlias local.iceberg_except_test
  : +- RelationV2[id#30, a#31, b#32] local.iceberg_except_test 
local.iceberg_except_test
  +- Project [id#33, a#34, b#35]
 +- SubqueryAlias local.iceberg_except_test
+- RelationV2[id#33, a#34, b#35] local.iceberg_except_test 
local.iceberg_except_test
   
   == Optimized Logical Plan ==
   Aggregate [id#30, a#31, b#32], [id#30, a#31, b#32]
   +- Union false, false
  :- RelationV2[id#30, a#31, b#32] local.iceberg_except_test
  +- RelationV2[id#33, a#34, b#35] local.iceberg_except_test
   ```
   
   With this patch, the same query is optimized as below:
   ```sql
   == Parsed Logical Plan ==
   'Distinct
   +- 'Union false, false
  :- 'Project [*]
  :  +- 'UnresolvedRelation [iceberg_except_test], [], false
  +- 'Project [*]
 +- 'RelationTimeTravel 'UnresolvedRelation [iceberg_except_test], [], 
false, 2024-01-01
   
   == Analyzed Logical Plan ==
   id: string, a: string, b: timestamp
   Distinct
   +- Union false, false
  :- Project [id#27, a#28, b#29]
  :  +- SubqueryAlias local.iceberg_except_test
  : +- RelationV2[id#27, a#28, b#29] local.iceberg_except_test 
local.iceberg_except_test
  +- Project [id#30, a#31, b#32]
 +- SubqueryAlias local.iceberg_except_test
+- RelationV2[id#30, a#31, b#32] local.iceberg_except_test 
local.iceberg_except_test
   
   == Optimized Logical Plan ==
   Aggregate [id#27, a#28, b#29], [id#27, a#28, b#29]
   +- RelationV2[id#27, a#28, b#29] local.iceberg_except_test
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on code in PR #9455:
URL: https://github.com/apache/iceberg/pull/9455#discussion_r1447345339


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##
@@ -405,15 +407,25 @@ public boolean equals(Object other) {
   return false;
 }
 
-// use only name in order to correctly invalidate Spark cache
+// use name and effective snapshot id to support time travel
 SparkTable that = (SparkTable) other;
-return icebergTable.name().equals(that.icebergTable.name());
+return icebergTable.name().equals(that.icebergTable.name())
+&& Objects.equals(effectiveSnapshotId(), that.effectiveSnapshotId());
   }
 
   @Override
   public int hashCode() {
-// use only name in order to correctly invalidate Spark cache
-return icebergTable.name().hashCode();
+// use name and effective snapshot id to support time travel
+return Objects.hash(icebergTable.name(), effectiveSnapshotId());
+  }
+
+  public Long effectiveSnapshotId() {

Review Comment:
   should we have a local variable `effectiveSnapshotId` and initialize the 
`effectiveSnapshotId` in the constructor instead of computing on every equals() 
and hashCode() call?



##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTable.java:
##
@@ -53,4 +53,41 @@ public void testTableEquality() throws NoSuchTableException {
 assertThat(table1).as("References must be different").isNotSameAs(table2);
 assertThat(table1).as("Tables must be equivalent").isEqualTo(table2);
   }
+
+  @TestTemplate
+  public void testEffectiveSnapshotIdEquality() throws NoSuchTableException {
+CatalogManager catalogManager = spark.sessionState().catalogManager();
+TableCatalog catalog = (TableCatalog) catalogManager.catalog(catalogName);
+Identifier identifier = Identifier.of(tableIdent.namespace().levels(), 
tableIdent.name());
+
+sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
+
+SparkTable table = (SparkTable) catalog.loadTable(identifier);
+final long version1Snapshot = table.effectiveSnapshotId();
+final String version1 = "VERSION_1";
+table.table().manageSnapshots().createTag(version1, 
version1Snapshot).commit();
+
+SparkTable firstSnapshotTable = table.copyWithSnapshotId(version1Snapshot);
+SparkTable firstTagTable = table.copyWithBranch(version1);
+
+sql("UPDATE %s SET data = 'b'", tableName);
+
+final String version2 = "VERSION_2";
+table.table().manageSnapshots().createTag(version2, 
table.effectiveSnapshotId()).commit();
+
+SparkTable secondTagTable = table.copyWithBranch(version2);
+
+assertThat(firstSnapshotTable)
+.as("References for two different SparkTables must be different")
+.isNotSameAs(firstTagTable);
+assertThat(firstSnapshotTable)
+.as("The different snapshots with same id must be equal")
+.isEqualTo(firstTagTable);
+assertThat(firstTagTable)
+.as("The different snapshots should not match")
+.isNotEqualTo(secondTagTable);
+assertThat(table)
+.as("The SparkTable should points to latest snapshot")
+.isEqualTo(secondTagTable);

Review Comment:
   Can we also add a SQL query that queries both the tags and validate the 
results like you have mentioned in the issue? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]

2024-01-10 Thread via GitHub


gjacoby126 commented on code in PR #9452:
URL: https://github.com/apache/iceberg/pull/9452#discussion_r1447388713


##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/util/FlinkVersionDetector.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.util;
+
+import org.apache.flink.streaming.api.datastream.DataStream;
+
+public class FlinkVersionDetector {
+  public String version() {
+String version = null;
+try {
+  version = getVersionFromJar();
+} catch (Exception e) {
+  /* we can't detect the exact implementation version from the jar (this 
can happen if the DataStream class

Review Comment:
   @pvary - Yes, it's a packaging issue. In my use case that reproduced the 
problem, the two instances of DataStream on the classpath are actually 
identical, due to unrelocated shading. That normally would cause no issues 
aside from build warnings. Of course we're fixing that too. 
   
   Whether the packaging problem causes other bugs or not though, Iceberg's 
reaction shouldn't be "throw an NPE and crash the Flink pipeline", which is 
what currently happens and this PR fixes. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2024-01-10 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1447409989


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##
@@ -142,9 +143,15 @@ public FileIO io() {
   @Override
   protected void doRefresh() {
 String metadataLocation = null;
+Table table = null;
+
 try {
-  Table table = metaClients.run(client -> client.getTable(database, 
tableName));
-  HiveOperationsBase.validateTableIsIceberg(table, fullName);
+  table = metaClients.run(client -> client.getTable(database, tableName));
+  HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName);
+
+  if 
(table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {

Review Comment:
   IIRC the decision was to do not use the HMS tableType for this. Shouldn't we 
use `BaseMetastoreTableOperations.TABLE_TYPE_PROP` property instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: rewrite should drop delete files by data sequence number partition wise [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on code in PR #9454:
URL: https://github.com/apache/iceberg/pull/9454#discussion_r1447410277


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDeleteFilesAction.java:
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.extensions;
+
+import static org.apache.iceberg.TableProperties.DROP_PARTITION_DELETE_ENABLED;
+import static org.apache.iceberg.types.Types.NestedField.optional;
+
+import java.io.File;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+import java.util.Random;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.BaseTable;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.PartitionKey;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.StructLike;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.actions.RewriteDataFiles;
+import org.apache.iceberg.actions.RewriteDataFiles.Result;
+import org.apache.iceberg.actions.SizeBasedFileRewriter;
+import org.apache.iceberg.data.GenericAppenderFactory;
+import org.apache.iceberg.data.GenericRecord;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.deletes.EqualityDeleteWriter;
+import org.apache.iceberg.deletes.PositionDelete;
+import org.apache.iceberg.deletes.PositionDeleteWriter;
+import org.apache.iceberg.encryption.EncryptedFiles;
+import org.apache.iceberg.encryption.EncryptedOutputFile;
+import org.apache.iceberg.encryption.EncryptionKeyMetadata;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.io.OutputFileFactory;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.relocated.com.google.common.collect.Streams;
+import org.apache.iceberg.spark.SparkTestBase;
+import org.apache.iceberg.spark.actions.SparkActions;
+import org.apache.iceberg.spark.source.ThreeColumnRecord;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.ArrayUtil;
+import org.apache.iceberg.util.Pair;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestRewriteDeleteFilesAction extends SparkTestBase {
+
+  private static final int SCALE = 40;
+
+  private static final HadoopTables TABLES = new HadoopTables(new 
Configuration());
+  private static final Schema SCHEMA =
+  new Schema(
+  optional(1, "c1", Types.IntegerType.get()),
+  optional(2, "c2", Types.StringType.get()),
+  optional(3, "c3", Types.StringType.get()));
+
+  PartitionSpec partitionSpecC1 = 
PartitionSpec.builderFor(SCHEMA).identity("c1").build();
+
+  @Rule public TemporaryFolder temp = new TemporaryFolder();
+
+  private String tableLocation = null;
+
+  @Before
+  public void setupTableLocation() throws Exception {
+File tableDir = temp.newFolder();
+this.tableLocation = tableDir.toURI().toString();
+  }
+
+  private Result rewriteTable(Table table) {
+return actions()
+.rewriteDataFiles(table)
+.option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, 
Long.toString(Long.MAX_VALUE - 1))
+.option(SizeBasedFileRewriter.REWRITE_ALL, "true")
+.execute();
+  }
+
+  @Test
+  public void testRewritePartitionDeletesShouldNotRetain() throws IOException {
+// TODO: anothe

Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

2024-01-10 Thread via GitHub


wooyeong commented on code in PR #9455:
URL: https://github.com/apache/iceberg/pull/9455#discussion_r1447412492


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##
@@ -405,15 +407,25 @@ public boolean equals(Object other) {
   return false;
 }
 
-// use only name in order to correctly invalidate Spark cache
+// use name and effective snapshot id to support time travel
 SparkTable that = (SparkTable) other;
-return icebergTable.name().equals(that.icebergTable.name());
+return icebergTable.name().equals(that.icebergTable.name())
+&& Objects.equals(effectiveSnapshotId(), that.effectiveSnapshotId());
   }
 
   @Override
   public int hashCode() {
-// use only name in order to correctly invalidate Spark cache
-return icebergTable.name().hashCode();
+// use name and effective snapshot id to support time travel
+return Objects.hash(icebergTable.name(), effectiveSnapshotId());
+  }
+
+  public Long effectiveSnapshotId() {

Review Comment:
   When you have `snapshotId` or `branch`, you can create it in advance. 
However, when you have neither, you should calculate 
`icebergTable.currentSnapshot()` every time as it can be changed.
   
   Moreover, `icebergTable.currentSnapshot()` can be null(for example, at [this 
point](https://github.com/apache/iceberg/blob/4d34398cfd32465222f55df522fcd5a2db59c92c/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTable.java#L53)),
 so we need to null check manually.
   
   I'll update `SparkTable` and the test case a little bit.



##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkTable.java:
##
@@ -53,4 +53,41 @@ public void testTableEquality() throws NoSuchTableException {
 assertThat(table1).as("References must be different").isNotSameAs(table2);
 assertThat(table1).as("Tables must be equivalent").isEqualTo(table2);
   }
+
+  @TestTemplate
+  public void testEffectiveSnapshotIdEquality() throws NoSuchTableException {
+CatalogManager catalogManager = spark.sessionState().catalogManager();
+TableCatalog catalog = (TableCatalog) catalogManager.catalog(catalogName);
+Identifier identifier = Identifier.of(tableIdent.namespace().levels(), 
tableIdent.name());
+
+sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
+
+SparkTable table = (SparkTable) catalog.loadTable(identifier);
+final long version1Snapshot = table.effectiveSnapshotId();
+final String version1 = "VERSION_1";
+table.table().manageSnapshots().createTag(version1, 
version1Snapshot).commit();
+
+SparkTable firstSnapshotTable = table.copyWithSnapshotId(version1Snapshot);
+SparkTable firstTagTable = table.copyWithBranch(version1);
+
+sql("UPDATE %s SET data = 'b'", tableName);
+
+final String version2 = "VERSION_2";
+table.table().manageSnapshots().createTag(version2, 
table.effectiveSnapshotId()).commit();
+
+SparkTable secondTagTable = table.copyWithBranch(version2);
+
+assertThat(firstSnapshotTable)
+.as("References for two different SparkTables must be different")
+.isNotSameAs(firstTagTable);
+assertThat(firstSnapshotTable)
+.as("The different snapshots with same id must be equal")
+.isEqualTo(firstTagTable);
+assertThat(firstTagTable)
+.as("The different snapshots should not match")
+.isNotEqualTo(secondTagTable);
+assertThat(table)
+.as("The SparkTable should points to latest snapshot")
+.isEqualTo(secondTagTable);

Review Comment:
   Great, I'll add some SQL query tests below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on code in PR #9455:
URL: https://github.com/apache/iceberg/pull/9455#discussion_r1447415016


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##
@@ -405,15 +407,25 @@ public boolean equals(Object other) {
   return false;
 }
 
-// use only name in order to correctly invalidate Spark cache
+// use name and effective snapshot id to support time travel
 SparkTable that = (SparkTable) other;
-return icebergTable.name().equals(that.icebergTable.name());
+return icebergTable.name().equals(that.icebergTable.name())
+&& Objects.equals(effectiveSnapshotId(), that.effectiveSnapshotId());
   }
 
   @Override
   public int hashCode() {
-// use only name in order to correctly invalidate Spark cache
-return icebergTable.name().hashCode();
+// use name and effective snapshot id to support time travel
+return Objects.hash(icebergTable.name(), effectiveSnapshotId());
+  }
+
+  public Long effectiveSnapshotId() {

Review Comment:
   Ack



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

2024-01-10 Thread via GitHub


wooyeong commented on code in PR #9455:
URL: https://github.com/apache/iceberg/pull/9455#discussion_r1447417783


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java:
##
@@ -405,15 +407,25 @@ public boolean equals(Object other) {
   return false;
 }
 
-// use only name in order to correctly invalidate Spark cache
+// use name and effective snapshot id to support time travel
 SparkTable that = (SparkTable) other;
-return icebergTable.name().equals(that.icebergTable.name());
+return icebergTable.name().equals(that.icebergTable.name())
+&& Objects.equals(effectiveSnapshotId(), that.effectiveSnapshotId());
   }
 
   @Override
   public int hashCode() {
-// use only name in order to correctly invalidate Spark cache
-return icebergTable.name().hashCode();
+// use name and effective snapshot id to support time travel
+return Objects.hash(icebergTable.name(), effectiveSnapshotId());
+  }
+
+  public Long effectiveSnapshotId() {

Review Comment:
   Please note the below statements, the `table`'s `currentSnapshot` is changed.
   
   ```java
   assertThat(table).as("The SparkTable points to latest 
snapshot").isEqualTo(firstTagTable);
   
   assertThat(table).as("The SparkTable points to latest 
snapshot").isEqualTo(secondTagTable);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2024-01-10 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1447424619


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,306 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Collections;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchIcebergViewException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.MetastoreOperationsUtil;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations implements 
HiveOperationsBase {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveViewOperations.class);
+
+  private final String fullName;
+  private final String database;
+  private final String viewName;
+  private final FileIO fileIO;
+  private final ClientPool metaClients;
+  private final long maxHiveTablePropertySize;
+
+  HiveViewOperations(
+  Configuration conf,
+  ClientPool metaClients,
+  FileIO fileIO,
+  String catalogName,
+  TableIdentifier viewIdentifier) {
+String dbName = viewIdentifier.namespace().level(0);
+this.metaClients = metaClients;
+this.fileIO = fileIO;
+this.fullName = BaseMetastoreCatalog.fullTableName(catalogName, 
viewIdentifier);
+this.database = dbName;
+this.viewName = viewIdentifier.name();
+this.maxHiveTablePropertySize =
+conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, 
HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT);
+  }
+
+  @Override
+  public void doRefresh() {
+String metadataLocation = null;
+Table table;
+
+try {
+  table = metaClients.run(client -> client.getTable(database, viewName));
+  HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName);
+
+  if 
(!table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+throw new NoSuchObjectException();
+  }

Review Comment:
   Why not throw the final exception immediately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2024-01-10 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1447426135


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,287 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Collections;
+import java.util.Locale;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchIcebergViewException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.MetastoreOperationsUtil;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations implements 
HiveOperationsBase {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveViewOperations.class);
+
+  private final String fullName;
+  private final String database;
+  private final String viewName;
+  private final FileIO fileIO;
+  private final ClientPool metaClients;
+  private final long maxHiveTablePropertySize;
+
+  HiveViewOperations(
+  Configuration conf,
+  ClientPool metaClients,
+  FileIO fileIO,
+  String catalogName,
+  TableIdentifier viewIdentifier) {
+String dbName = viewIdentifier.namespace().level(0);
+this.metaClients = metaClients;
+this.fileIO = fileIO;
+this.fullName = BaseMetastoreCatalog.fullTableName(catalogName, 
viewIdentifier);
+this.database = dbName;
+this.viewName = viewIdentifier.name();
+this.maxHiveTablePropertySize =
+conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, 
HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT);
+  }
+
+  @Override
+  public void doRefresh() {
+String metadataLocation = null;
+Table table = null;
+
+try {
+  table = metaClients.run(client -> client.getTable(database, viewName));
+  HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName);
+  metadataLocation =
+  
table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+
+} catch (NoSuchObjectException e) {
+  if (currentMetadataLocation() != null) {
+throw new NoSuchViewException("View does not exist: %s.%s", database, 
viewName);
+  }
+} catch (TException e) {
+  String errMsg =
+  String.format("Failed to get view info from metastore %s.%s", 
database, viewName);
+  throw new RuntimeException(errMsg, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted during refresh", e);
+}
+
+if (table != null && 
!table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  disableRefresh();
+} else {
+  refreshFromMetadataLocation(metadataLocation);
+}
+  }
+
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")
+  @Override
+  public void doCommit(ViewMetadata base, ViewMetadata metadata) {

Review Comment:
   Why not use commit lock for view creation, modification?
   We could have the same concurrency issues here, that with the tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL abo

Re: [PR] Hive: Unwrap RuntimeException for Hive InvalidOperationException with rename table [iceberg]

2024-01-10 Thread via GitHub


pvary commented on code in PR #9432:
URL: https://github.com/apache/iceberg/pull/9432#discussion_r1447431727


##
hive-metastore/src/main/java/org/apache/iceberg/hive/MetastoreUtil.java:
##
@@ -72,6 +73,23 @@ public static void alterTable(
 env.putAll(extraEnv);
 env.put(StatsSetupConst.DO_NOT_UPDATE_STATS, StatsSetupConst.TRUE);
 
-ALTER_TABLE.invoke(client, databaseName, tblName, table, new 
EnvironmentContext(env));
+try {
+  ALTER_TABLE.invoke(client, databaseName, tblName, table, new 
EnvironmentContext(env));
+} catch (RuntimeException e) {

Review Comment:
   I made a mistake in the above command, I thought, that the root exception is 
`MetaException`, but in reality it is `TException`.
   
   So fixing this issue, my comment correctly is below:
   -
   The `alterTable` wraps the exceptions to a `RuntimeException` because of the 
`DynMethod` calls. I think this is a mistake. We should behave the same way as 
the `metaClients.run(client -> client...)` like calls, we should throw the 
original `TException` - my opinion here could be changes if this becomes a 
breaking change somewhere, but if it does not break code for someone, then we 
should change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Fix errorprone warnings [iceberg]

2024-01-10 Thread via GitHub


Fokko merged PR #9419:
URL: https://github.com/apache/iceberg/pull/9419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Unwrap RuntimeException for Hive InvalidOperationException with rename table [iceberg]

2024-01-10 Thread via GitHub


pvary commented on code in PR #9432:
URL: https://github.com/apache/iceberg/pull/9432#discussion_r1447431727


##
hive-metastore/src/main/java/org/apache/iceberg/hive/MetastoreUtil.java:
##
@@ -72,6 +73,23 @@ public static void alterTable(
 env.putAll(extraEnv);
 env.put(StatsSetupConst.DO_NOT_UPDATE_STATS, StatsSetupConst.TRUE);
 
-ALTER_TABLE.invoke(client, databaseName, tblName, table, new 
EnvironmentContext(env));
+try {
+  ALTER_TABLE.invoke(client, databaseName, tblName, table, new 
EnvironmentContext(env));
+} catch (RuntimeException e) {

Review Comment:
   I made a mistake in the above command, I thought, that the root exception is 
`MetaException`, but in reality it is `TException`.
   
   So fixing this issue, my comment correctly is below:
   
   -
   
   The `alterTable` wraps the exceptions to a `RuntimeException` because of the 
`DynMethod` calls. I think this is a mistake. We should behave the same way as 
the `metaClients.run(client -> client...)` like calls, we should throw the 
original `TException` - my opinion here could be changes if this becomes a 
breaking change somewhere, but if it does not break code for someone, then we 
should change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2024-01-10 Thread via GitHub


Fokko commented on PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#issuecomment-1884916386

   Since there are no further comments, I'll go ahead and merge this. I would 
like to express my gratitude to @bryanck for working on this since this will 
help so many people in the Kafka community to get their data in Iceberg in a 
fast and reliable way! 🙏 Thanks @ajantha-bhat, @danielcweeks, @rdblue, 
@jbonofre, @ajantha-bhat and @nastra for the review 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2024-01-10 Thread via GitHub


Fokko merged PR #8701:
URL: https://github.com/apache/iceberg/pull/8701


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2024-01-10 Thread via GitHub


jbonofre commented on PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#issuecomment-1884919080

   @Fokko awesome, thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Kafka Connect within Iceberg [iceberg]

2024-01-10 Thread via GitHub


Fokko closed issue #4977: Support Kafka Connect within Iceberg
URL: https://github.com/apache/iceberg/issues/4977


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Kafka Connect within Iceberg [iceberg]

2024-01-10 Thread via GitHub


Fokko commented on issue #4977:
URL: https://github.com/apache/iceberg/issues/4977#issuecomment-1884930862

   Hey everyone, the Kafka connect sink by Tabular has been donated to Iceberg 
in https://github.com/apache/iceberg/pull/8701. I go ahead and close this 
issue, feel free to open up a new one if there are any further questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink1.12.1 +Iceberg0.12.0 has problems with real-time reading and writing in upsert mode [iceberg]

2024-01-10 Thread via GitHub


Fokko closed issue #3277: Flink1.12.1 +Iceberg0.12.0 has problems with 
real-time reading and writing in upsert mode
URL: https://github.com/apache/iceberg/issues/3277


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink1.12.1 +Iceberg0.12.0 has problems with real-time reading and writing in upsert mode [iceberg]

2024-01-10 Thread via GitHub


Fokko commented on issue #3277:
URL: https://github.com/apache/iceberg/issues/3277#issuecomment-1884931848

   This has been fixed in a later version, I'll go ahead and close this for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Snowflake Iceberg Partitioned data read issue [iceberg]

2024-01-10 Thread via GitHub


purna344 commented on issue #9404:
URL: https://github.com/apache/iceberg/issues/9404#issuecomment-1885023931

   If the producers write the data in storage by setting the below config value 
   `spark.conf.set("spark.databricks.delta.writePartitionColumnsToParquet", 
"false")`
   Then *.parquet file does not have the partition columns related information 
and partition values are stored in the file path.
   It is not possible for us to ask producers don't set this config value in 
their spark jobs and publish the data.
   I heard that Iceberg format expect the partition values in the parquet file. 
How to handle this scenario and does iceberg support any config parameter for 
to read the partition values from the folder path?
   CC: @amogh-jahagirdar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests in SQL directory to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra merged PR #9401:
URL: https://github.com/apache/iceberg/pull/9401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447600190


##
core/src/test/java/org/apache/iceberg/TestBase.java:
##
@@ -173,7 +173,7 @@ public class TestBase {
   public TestTables.TestTable table = null;
 
   @Parameters(name = "formatVersion = {0}")
-  protected static List parameters() {
+  protected static List parameters() {

Review Comment:
   why is this change necessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]

2024-01-10 Thread via GitHub


gjacoby126 commented on code in PR #9452:
URL: https://github.com/apache/iceberg/pull/9452#discussion_r1447601555


##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/util/FlinkVersionDetector.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.util;
+
+import org.apache.flink.streaming.api.datastream.DataStream;
+
+public class FlinkVersionDetector {
+  public String version() {
+String version = null;
+try {
+  version = getVersionFromJar();
+} catch (Exception e) {
+  /* we can't detect the exact implementation version from the jar (this 
can happen if the DataStream class

Review Comment:
   Oh, and just fyi, I'm not affiliated with the engineers who posted the 
original issue this patch fixes; I just came across it while I was 
investigating. So this has happened at least twice in real-world use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447603337


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/TestSparkDistributedDataScanDeletes.java:
##
@@ -21,42 +21,39 @@
 import static org.apache.iceberg.PlanningMode.DISTRIBUTED;
 import static org.apache.iceberg.PlanningMode.LOCAL;
 
+import java.util.Arrays;
+import java.util.List;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
 import org.apache.iceberg.spark.SparkReadConf;
 import org.apache.spark.sql.SparkSession;
 import org.apache.spark.sql.internal.SQLConf;
-import org.junit.AfterClass;
-import org.junit.Before;
-import org.junit.BeforeClass;
-import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.extension.ExtendWith;
 
-@RunWith(Parameterized.class)
+@ExtendWith(ParameterizedTestExtension.class)
 public class TestSparkDistributedDataScanDeletes
 extends DeleteFileIndexTestBase> {
 
-  @Parameterized.Parameters(name = "dataMode = {0}, deleteMode = {1}")
-  public static Object[] parameters() {
-return new Object[][] {
-  new Object[] {LOCAL, LOCAL},
-  new Object[] {LOCAL, DISTRIBUTED},
-  new Object[] {DISTRIBUTED, LOCAL},
-  new Object[] {DISTRIBUTED, DISTRIBUTED}
-};
+  @Parameters(name = "formatVersion = {0}, dataMode = {1}, deleteMode = {2}")
+  public static List parameters() {

Review Comment:
   ```suggestion
 public static List parameters() {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447602835


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/SparkDistributedDataScanTestBase.java:
##
@@ -21,46 +21,41 @@
 import static org.apache.iceberg.PlanningMode.DISTRIBUTED;
 import static org.apache.iceberg.PlanningMode.LOCAL;
 
+import java.util.Arrays;
+import java.util.List;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
 import org.apache.iceberg.spark.SparkReadConf;
 import org.apache.spark.sql.SparkSession;
 import org.apache.spark.sql.internal.SQLConf;
-import org.junit.Before;
-import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
-import org.junit.runners.Parameterized.Parameters;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.extension.ExtendWith;
 
-@RunWith(Parameterized.class)
+@ExtendWith(ParameterizedTestExtension.class)
 public abstract class SparkDistributedDataScanTestBase
 extends DataTableScanTestBase> {
 
   @Parameters(name = "formatVersion = {0}, dataMode = {1}, deleteMode = {2}")
-  public static Object[] parameters() {
-return new Object[][] {
-  new Object[] {1, LOCAL, LOCAL},
-  new Object[] {1, LOCAL, DISTRIBUTED},
-  new Object[] {1, DISTRIBUTED, LOCAL},
-  new Object[] {1, DISTRIBUTED, DISTRIBUTED},
-  new Object[] {2, LOCAL, LOCAL},
-  new Object[] {2, LOCAL, DISTRIBUTED},
-  new Object[] {2, DISTRIBUTED, LOCAL},
-  new Object[] {2, DISTRIBUTED, DISTRIBUTED}
-};
+  public static List parameters() {

Review Comment:
   ```suggestion
 public static List parameters() {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]

2024-01-10 Thread via GitHub


pvary commented on code in PR #9452:
URL: https://github.com/apache/iceberg/pull/9452#discussion_r1447604053


##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/util/FlinkVersionDetector.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.util;
+
+import org.apache.flink.streaming.api.datastream.DataStream;
+
+public class FlinkVersionDetector {
+  public String version() {
+String version = null;
+try {
+  version = getVersionFromJar();
+} catch (Exception e) {
+  /* we can't detect the exact implementation version from the jar (this 
can happen if the DataStream class

Review Comment:
   Why did we get the `NullPointerException` in this case?
   By my experience if we have the class on the classpath multiple times, Java 
just uses the first one it founds. Was it because the shading messed up with 
the metadata of the jar?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447604611


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/TestSparkDistributedDataScanReporting.java:
##
@@ -21,41 +21,38 @@
 import static org.apache.iceberg.PlanningMode.DISTRIBUTED;
 import static org.apache.iceberg.PlanningMode.LOCAL;
 
+import java.util.Arrays;
+import java.util.List;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
 import org.apache.iceberg.spark.SparkReadConf;
 import org.apache.spark.sql.SparkSession;
 import org.apache.spark.sql.internal.SQLConf;
-import org.junit.AfterClass;
-import org.junit.BeforeClass;
-import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.extension.ExtendWith;
 
-@RunWith(Parameterized.class)
+@ExtendWith(ParameterizedTestExtension.class)
 public class TestSparkDistributedDataScanReporting
 extends ScanPlanningAndReportingTestBase> {
 
-  @Parameterized.Parameters(name = "dataMode = {0}, deleteMode = {1}")
-  public static Object[] parameters() {
-return new Object[][] {
-  new Object[] {LOCAL, LOCAL},
-  new Object[] {LOCAL, DISTRIBUTED},
-  new Object[] {DISTRIBUTED, LOCAL},
-  new Object[] {DISTRIBUTED, DISTRIBUTED}
-};
+  @Parameters(name = "formatVersion = {0}, dataMode = {1}, deleteMode = {2}")
+  public static List parameters() {

Review Comment:
   ```suggestion
 public static List parameters() {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447605228


##
data/src/test/java/org/apache/iceberg/io/TestGenericSortedPosDeleteWriter.java:
##
@@ -62,7 +62,7 @@ public class TestGenericSortedPosDeleteWriter extends 
TestBase {
   private Record gRecord;
 
   @Parameters(name = "formatVersion = {0}, fileFormat = {1}")
-  public static List parameters() {
+  public static List parameters() {

Review Comment:
   this should be reverted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447607114


##
core/src/test/java/org/apache/iceberg/TestLocalFilterFiles.java:
##
@@ -18,20 +18,17 @@
  */
 package org.apache.iceberg;
 
-import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
+import java.util.Arrays;
+import java.util.List;
+import org.junit.jupiter.api.extension.ExtendWith;
 
-@RunWith(Parameterized.class)
+@ExtendWith(ParameterizedTestExtension.class)
 public class TestLocalFilterFiles
 extends FilterFilesTestBase {
 
-  @Parameterized.Parameters(name = "formatVersion = {0}")
-  public static Object[] parameters() {
-return new Object[] {1, 2};
-  }
-
-  public TestLocalFilterFiles(int formatVersion) {
-super(formatVersion);
+  @Parameters(name = "formatVersion = {0}")
+  public static List parameters() {

Review Comment:
   ```suggestion
 public static List parameters() {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1447606490


##
core/src/test/java/org/apache/iceberg/FilterFilesTestBase.java:
##
@@ -24,67 +24,67 @@
 import java.io.File;
 import java.io.IOException;
 import java.nio.ByteBuffer;
+import java.nio.file.Files;
+import java.nio.file.Path;
 import java.util.Map;
 import org.apache.iceberg.expressions.Expressions;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
 import org.apache.iceberg.relocated.com.google.common.collect.Maps;
 import org.apache.iceberg.types.Conversions;
 import org.apache.iceberg.types.Types;
-import org.junit.After;
-import org.junit.Before;
-import org.junit.Rule;
-import org.junit.Test;
-import org.junit.rules.TemporaryFolder;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.junit.jupiter.api.io.TempDir;
 
+@ExtendWith(ParameterizedTestExtension.class)
 public abstract class FilterFilesTestBase<
 ScanT extends Scan, T extends ScanTask, G extends 
ScanTaskGroup> {
 
-  public final int formatVersion;
-
-  public FilterFilesTestBase(int formatVersion) {
-this.formatVersion = formatVersion;
-  }
+  @Parameter(index = 0)
+  public int formatVersion;

Review Comment:
   does this need to be public or can we switch it to protected?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9381:
URL: https://github.com/apache/iceberg/pull/9381#discussion_r1447617468


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java:
##
@@ -33,30 +36,27 @@
 import org.apache.iceberg.FileFormat;
 import org.apache.iceberg.Table;
 import org.apache.iceberg.TestHelpers;
-import org.apache.iceberg.catalog.Namespace;
 import org.apache.iceberg.catalog.TableIdentifier;
 import org.apache.iceberg.data.GenericAppenderHelper;
 import org.apache.iceberg.data.GenericRecord;
 import org.apache.iceberg.data.Record;
-import org.apache.iceberg.flink.FlinkCatalogTestBase;
+import org.apache.iceberg.flink.CatalogTestBase;
 import org.apache.iceberg.flink.MiniClusterResource;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.assertj.core.api.Assertions;
-import org.junit.After;
-import org.junit.Assert;
-import org.junit.Before;
-import org.junit.Test;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.io.TempDir;
 
-public class TestStreamScanSql extends FlinkCatalogTestBase {
+public class TestStreamScanSql extends CatalogTestBase {
   private static final String TABLE = "test_table";
   private static final FileFormat FORMAT = FileFormat.PARQUET;
 
   private TableEnvironment tEnv;
 
-  public TestStreamScanSql(String catalogName, Namespace baseNamespace) {
-super(catalogName, baseNamespace);
-  }
+  private @TempDir Path temp;

Review Comment:
   no need to define this, you should be able to use `temporaryDirectory` from 
the super class. However, I noticed that you might need to change the 
visibility of `Testbase.temporaryDirectory` to protected



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9381:
URL: https://github.com/apache/iceberg/pull/9381#discussion_r1447617985


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java:
##
@@ -127,20 +127,16 @@ private void insertRows(Table table, Row... rows) throws 
IOException {
 
   private void assertRows(List expectedRows, Iterator iterator) {
 for (Row expectedRow : expectedRows) {
-  Assert.assertTrue("Should have more records", iterator.hasNext());
-
+  assertThat(iterator.hasNext()).isTrue();

Review Comment:
   ```suggestion
 assertThat(iterator).hasNext();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Kafka Connect within Iceberg [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on issue #4977:
URL: https://github.com/apache/iceberg/issues/4977#issuecomment-1885176312

   > Hey everyone, the Kafka connect sink by Tabular has been donated to 
Iceberg in https://github.com/apache/iceberg/pull/8701. I go ahead and close 
this issue, feel free to open up a new one if there are any further questions.
   
   Note: Just the initial PR got merged. Still a long way to go. But yeah, we 
can close this issue mentioning it is already supported from 
https://github.com/tabular-io/iceberg-kafka-connect and is also in the process 
of donation. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Added error handling and default logic for Flink version detection [iceberg]

2024-01-10 Thread via GitHub


gjacoby126 commented on code in PR #9452:
URL: https://github.com/apache/iceberg/pull/9452#discussion_r1447629862


##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/util/FlinkVersionDetector.java:
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.util;
+
+import org.apache.flink.streaming.api.datastream.DataStream;
+
+public class FlinkVersionDetector {
+  public String version() {
+String version = null;
+try {
+  version = getVersionFromJar();
+} catch (Exception e) {
+  /* we can't detect the exact implementation version from the jar (this 
can happen if the DataStream class

Review Comment:
   I don't know for sure, but the shading messing up the metadata is my best 
guess too. 
   
   In addition to the unit test in the patch, I did confirm that the patch 
resolves the issue in a real Flink cluster. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2024-01-10 Thread via GitHub


bryanck commented on PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#issuecomment-1885241611

   Awesome! Thanks all for the feedback and guidance. I'll follow up with PRs 
for the actual sink portion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] [Bug fix] Ensure string env var s3.connect-timeout is cast to float [iceberg-python]

2024-01-10 Thread via GitHub


syun64 opened a new pull request, #259:
URL: https://github.com/apache/iceberg-python/pull/259

   ```
   %env PYICEBERG_CATALOG__LACUS__S3.CONNECT-TIMEOUT=60
   
   from pyiceberg.catalog import load_catalog
   catalog = load_catalog("test")
   tbl = catalog.load_table("test.test")
   tbl.scan().to_arrow().to_pandas()
   
   ```
   Stacktrace:
   
   ```
   TypeError Traceback (most recent call last)
   /tmp/ipykernel_734/2954162570.py in ()
 1 tbl = catalog.load_table("test.table")
   > 2 tbl.scan().to_arrow().to_pandas()
   
   
/layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/table/__init__.py
 in to_arrow(self)
  1290 
  1291 return project_table(
   -> 1292 self.plan_files(),
  1293 self.table,
  1294 self.row_filter,
   
   
/layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/table/__init__.py
 in plan_files(self)
  1232 manifests = [
  1233 manifest_file
   -> 1234 for manifest_file in snapshot.manifests(io)
  1235 if 
manifest_evaluators[manifest_file.partition_spec_id](manifest_file)
  1236 ]
   
   
/layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/table/snapshots.py
 in manifests(self, io)
   157 def manifests(self, io: FileIO) -> List[ManifestFile]:
   158 if self.manifest_list is not None:
   --> 159 file = io.new_input(self.manifest_list)
   160 return list(read_manifest_list(file))
   161 return []
   
   
/layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py
 in new_input(self, location)
   392 scheme, netloc, path = self.parse_location(location)
   393 return PyArrowFile(
   --> 394 fs=self.fs_by_scheme(scheme, netloc),
   395 location=location,
   396 path=path,
   
   
/layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyiceberg/io/pyarrow.py
 in _initialize_fs(self, scheme, netloc)
   340 client_kwargs["connect_timeout"] = connect_timeout
   341 
   --> 342 return S3FileSystem(**client_kwargs)
   343 elif scheme == "hdfs":
   344 from pyarrow.fs import HadoopFileSystem
   
   
/layers/com.bloomberg.ds.buildpacks.pip/requirements/lib/python3.10/site-packages/pyarrow/_s3fs.pyx
 in pyarrow._s3fs.S3FileSystem.__init__()
   
   TypeError: must be real number, not str
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [Bug fix] Ensure string env var s3.connect-timeout is cast to float [iceberg-python]

2024-01-10 Thread via GitHub


Fokko commented on PR #259:
URL: https://github.com/apache/iceberg-python/pull/259#issuecomment-1885265016

   @syun64 Can you fix the `mypy` violation:
   
   ```
   pyiceberg/io/pyarrow.py:336: error: Incompatible types in assignment 
(expression has type "float", target has type "Optional[str]")  [assignment]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Spark: Support min/max/count push down for partition columns [iceberg]

2024-01-10 Thread via GitHub


xiaoxuandev opened a new pull request, #9457:
URL: https://github.com/apache/iceberg/pull/9457

   ### Notes
   Support min/max/count aggregate push down for partition columns
   
   - min/max/count aggregate push down is not working if partition columns 
don't present as data columns(the stats won't be present in avro files), so 
even the aggregate has been push down to data source, `AggregateEvaluator` will 
fail, it still go through full table scan
   - add support by updating evaluator based on PartitionData 
   
   ### Testing
   Creating a hive table: 
   CREATE EXTERNAL TABLE store_sales (id int, data INT) PARTITIONED BY 
(ss_sold_date_sk INT)
   then registered as Iceberg table
   
   Tested on Spark 3.5, verified count/min/max been successfully pushdown, and 
simple queries (`select count(ss_sold_date_sk) from store_sales` , `select 
min(ss_sold_date_sk) from store_sales` and `select max(ss_sold_date_sk) from 
store_sales`) has been speed up with the change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [Bug fix] Ensure string env var s3.connect-timeout is cast to float [iceberg-python]

2024-01-10 Thread via GitHub


Fokko commented on PR #259:
URL: https://github.com/apache/iceberg-python/pull/259#issuecomment-1885335933

   Thanks for fixing this @syun64 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [Bug fix] Ensure string env var s3.connect-timeout is cast to float [iceberg-python]

2024-01-10 Thread via GitHub


Fokko merged PR #259:
URL: https://github.com/apache/iceberg-python/pull/259


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support partitioned writes [iceberg-python]

2024-01-10 Thread via GitHub


jqin61 commented on issue #208:
URL: https://github.com/apache/iceberg-python/issues/208#issuecomment-1885365903

   > In Iceberg it can be that some files are still on an older partitioning, 
we should make sure that we handle those correctly based on the that we provide.
   
   It seems Spark's iceberg support has such overwrite behaviors under schema 
evolution:
   - dynamic overwrite: data files generated from old partition spec will not 
be replaced even if some of the records match the overwriting data
   - static overwrite with PARTITION values specified: same as above
   - static overwrite without PARTITION values: all data is deleted regardless 
of what partition specs they conform to.
   
   As Fokko mentioned, we need to make sure in the implementation we use the 
latest partition spec_id when overwriting partitions so that the data in the 
old partition spec is not touched.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Data: Errors in some file readers do not report the file in which they failed [iceberg]

2024-01-10 Thread via GitHub


RussellSpitzer opened a new issue, #9458:
URL: https://github.com/apache/iceberg/issues/9458

   ### Feature Request / Improvement
   
   There are several places in our code currently where a failure while reading 
a file will throw an exception but the exception will not contain any 
information related to which file was being read during the failure. The Avro 
reader is an example of this.
   
   When planning a table scan with a corrupted manifest the user will end up 
with an exception like
   ```java
   org.apache.iceberg.exceptions.RuntimeIOException: Failed to read next record 
at 
   
org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.next(AvroIterable.java:204)
 at
   org.apache.iceberg.io.CloseableIterable$7$1.next(CloseableIterable.java:202) 
at
   org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65) at
   org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49) at
   
org.apache.iceberg.io.CloseableIterable$7$1.hasNext(CloseableIterable.java:197) 
at
   org.apache.iceberg.io.CloseableIterator$2.hasNext(CloseableIterator.java:72) 
at
   org.apache.iceberg.io.ClosingIterator.hasNext(ClosingIterator.java:39) at
   
scala.collection.convert.JavaCollectionWrappers$JIteratorWrapper.hasNext(JavaCollectionWrappers.scala:37)
 at
   scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:593) at
   scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576) at
   
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source) at 
   ``` 
   
   
   This is obviously not very useful since we do not know which one it is. I 
think we should go into our AvroIterable (and other file format readers) and 
make sure that when they include the file path in the error.
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]

2024-01-10 Thread via GitHub


vinitpatni commented on code in PR #9381:
URL: https://github.com/apache/iceberg/pull/9381#discussion_r1447849833


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java:
##
@@ -33,30 +36,27 @@
 import org.apache.iceberg.FileFormat;
 import org.apache.iceberg.Table;
 import org.apache.iceberg.TestHelpers;
-import org.apache.iceberg.catalog.Namespace;
 import org.apache.iceberg.catalog.TableIdentifier;
 import org.apache.iceberg.data.GenericAppenderHelper;
 import org.apache.iceberg.data.GenericRecord;
 import org.apache.iceberg.data.Record;
-import org.apache.iceberg.flink.FlinkCatalogTestBase;
+import org.apache.iceberg.flink.CatalogTestBase;
 import org.apache.iceberg.flink.MiniClusterResource;
 import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.assertj.core.api.Assertions;
-import org.junit.After;
-import org.junit.Assert;
-import org.junit.Before;
-import org.junit.Test;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.io.TempDir;
 
-public class TestStreamScanSql extends FlinkCatalogTestBase {
+public class TestStreamScanSql extends CatalogTestBase {
   private static final String TABLE = "test_table";
   private static final FileFormat FORMAT = FileFormat.PARQUET;
 
   private TableEnvironment tEnv;
 
-  public TestStreamScanSql(String catalogName, Namespace baseNamespace) {
-super(catalogName, baseNamespace);
-  }
+  private @TempDir Path temp;

Review Comment:
   ack



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]

2024-01-10 Thread via GitHub


vinitpatni commented on code in PR #9381:
URL: https://github.com/apache/iceberg/pull/9381#discussion_r1447853965


##
flink/v1.18/flink/src/test/java/org/apache/iceberg/flink/source/TestStreamScanSql.java:
##
@@ -127,20 +127,16 @@ private void insertRows(Table table, Row... rows) throws 
IOException {
 
   private void assertRows(List expectedRows, Iterator iterator) {
 for (Row expectedRow : expectedRows) {
-  Assert.assertTrue("Should have more records", iterator.hasNext());
-
+  assertThat(iterator.hasNext()).isTrue();

Review Comment:
   ack



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Adding Junit 5 conversion and AssertJ style for TestFlinkCatalogTable… [iceberg]

2024-01-10 Thread via GitHub


vinitpatni commented on PR #9381:
URL: https://github.com/apache/iceberg/pull/9381#issuecomment-1885563043

   - Addressing Review Comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] If iceberg's parquet data file contains an array of time type, it cannot be read by spark library even after dropping this column [iceberg]

2024-01-10 Thread via GitHub


huan233usc commented on issue #9446:
URL: https://github.com/apache/iceberg/issues/9446#issuecomment-1885597211

   Hi, if someone could guide me on some information about the history of 
batchedReaderFunc vs  ReaderFunc and some related testing code path, happy to 
work on the fix for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate remaining tests in source directory to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9380:
URL: https://github.com/apache/iceberg/pull/9380#discussion_r1447970015


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java:
##
@@ -44,57 +48,58 @@
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.RowFactory;
 import org.apache.spark.sql.SparkSession;
-import org.assertj.core.api.Assertions;
-import org.junit.AfterClass;
-import org.junit.Assert;
-import org.junit.BeforeClass;
-import org.junit.Rule;
-import org.junit.Test;
-import org.junit.rules.TemporaryFolder;
-import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
-
-@RunWith(Parameterized.class)
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.junit.jupiter.api.io.TempDir;
+
+@ExtendWith(ParameterizedTestExtension.class)
 public class TestSnapshotSelection {
 
-  @Parameterized.Parameters(name = "planningMode = {0}")
+  @Parameters(name = "properties = {0}")
   public static Object[] parameters() {
-return new Object[] {LOCAL, DISTRIBUTED};
+return new Object[][] {
+  {
+ImmutableMap.of(
+TableProperties.DATA_PLANNING_MODE, LOCAL.modeName(),
+TableProperties.DELETE_PLANNING_MODE, LOCAL.modeName())
+  },
+  {
+ImmutableMap.of(
+TableProperties.DATA_PLANNING_MODE, DISTRIBUTED.modeName(),
+TableProperties.DELETE_PLANNING_MODE, DISTRIBUTED.modeName())
+  }
+};
   }
 
   private static final Configuration CONF = new Configuration();
   private static final Schema SCHEMA =
   new Schema(
   optional(1, "id", Types.IntegerType.get()), optional(2, "data", 
Types.StringType.get()));
 
-  @Rule public TemporaryFolder temp = new TemporaryFolder();
+  @TempDir private Path temp;
 
   private static SparkSession spark = null;
 
-  private final Map properties;
-
-  public TestSnapshotSelection(PlanningMode planningMode) {
-this.properties =
-ImmutableMap.of(
-TableProperties.DATA_PLANNING_MODE, planningMode.modeName(),
-TableProperties.DELETE_PLANNING_MODE, planningMode.modeName());
-  }
+  @Parameter(index = 0)
+  private Map properties;
 
-  @BeforeClass
-  public static void startSpark() {
+  @BeforeEach
+  public void startSpark() {

Review Comment:
   previously this was started/stopped at the class-level. Any particular 
reason why we start/stop this now at the test method level?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate remaining tests in source directory to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


nastra commented on code in PR #9380:
URL: https://github.com/apache/iceberg/pull/9380#discussion_r1447971372


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java:
##
@@ -123,15 +126,9 @@ public static Object[][] parameters() {
 };
   }
 
-  public TestPositionDeletesTable(
-  String catalogName, String implementation, Map config, 
FileFormat format) {
-super(catalogName, implementation, config);
-this.format = format;
-  }
-
-  @Rule public TemporaryFolder temp = new TemporaryFolder();
+  @TempDir private Path temp;

Review Comment:
   is this needed? I think you should be able to use the one defined in the 
super class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] iceberg reports an error after upgrading to 1.4.2 [iceberg]

2024-01-10 Thread via GitHub


ZachDischner commented on issue #9018:
URL: https://github.com/apache/iceberg/issues/9018#issuecomment-1885807415

   I am also seeing this issue. I have existing Iceberg tables, for which a 
large number of Spark SQL queries simply fail once I use more updated 
libraries. 
   
   My existing tables were created and updated using Spark on EMR over the past 
year. I can recreate only on modern EMR/Iceberg environments for the same 
queries that run on previous ones.  
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/Iceberg-release-history.html
   
   emr-6.10.0 - Reads and Writes work
   emr-6.10.1 - Reads and Writes work
   ...
   emr-6.14.0 - Reads and Writes work
   emr-6.15.0 - Some reads _don't_ work
   emr-7.0.0 - Some reads _don't_ work
   
   The cutoff appears to be Iceberg version `1.4.0`+. 
   
   I'm not sure if this helps, working with an obfuscated example. The 
situation I'm seeing where a query that fails includes many CTEs, and the error 
only appears with a particular one. 
   
   ```
   spark.sql("""
   WITH 
 a as (SELECT * FROM table WHERE ),
 b as (SELECT * FROM table2 WHERE )
... joins, filters, etc
   z as (SELECT * FROM a union b union c join d...)
   SELECT * FROM z"""
   )
   ```
   
   An intermediate CTE is where the error manifests. I cannot tell anything 
about it that is immediately suspicious
   ```
   m AS (SELECT col1, col2, col3, col4, ... FROM l)
   ```
   
   Such that `SELECT * FROM l` succeeds, but `SELECT * FROM m` fails. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Build: Bump cython from 3.0.7 to 3.0.8 [iceberg-python]

2024-01-10 Thread via GitHub


dependabot[bot] opened a new pull request, #260:
URL: https://github.com/apache/iceberg-python/pull/260

   Bumps [cython](https://github.com/cython/cython) from 3.0.7 to 3.0.8.
   
   Changelog
   Sourced from https://github.com/cython/cython/blob/master/CHANGES.rst";>cython's 
changelog.
   
   3.0.8 (2024-01-10)
   Bugs fixed
   
   
   Using const together with defined fused types could fail to 
compile.
   (Github issue :issue:5230)
   
   
   A "use after free" bug was fixed in parallel sections.
   (Github issue :issue:5922)
   
   
   Several types were not available as cython.* types in pure 
Python code.
   
   
   The generated code is now correct C89 again, removing some C++ style 
// comments
   and C99-style declaration-after-code code ordering.  This is still relevant 
for some
   ols C compilers, specifically ones that match old Python 2.7 
installations.
   
   
   
   
   
   Commits
   
   https://github.com/cython/cython/commit/a1b79a6bc5326406ad73af73f5b41e3bb5f8da6e";>a1b79a6
 Prepare release of 3.0.8.
   https://github.com/cython/cython/commit/b9bfa7f0492f4f71af1f034822fd90dd4ed3638e";>b9bfa7f
 Fix parsing of ptrdiff_t in PyrexTypes and add another "all types in 
Shadow.p...
   https://github.com/cython/cython/commit/f974ec15b643dfb6338c0aef90976424d5a6bd2c";>f974ec1
 Update changelog.
   https://github.com/cython/cython/commit/ffe6fa7fe47e6b5005c0aad7c6ae7144b3402f33";>ffe6fa7
 Avoid C99-isms.
   https://github.com/cython/cython/commit/356495be50773262b09e158115aba0afe167cb97";>356495b
 Avoid C99-ism.
   https://github.com/cython/cython/commit/b85be7e838318d251fc3d3fbfdf1a1ecf5a515fd";>b85be7e
 Avoid C99-ism.
   https://github.com/cython/cython/commit/30a6534a2279eddfd7a8b75de94897357eeaba2b";>30a6534
 Fix some C99-isms in 3.0.x branch.
   https://github.com/cython/cython/commit/9866ce478fa76a07babff8a5d9d53a4030eaf9e7";>9866ce4
 Avoid C99-ism.
   https://github.com/cython/cython/commit/6990d6e24f89949c4cf32d143d8c20e516d93756";>6990d6e
 Use Py3.6 instead of Py3.7/8/9 for the C89 build since CPython switched to 
C9...
   https://github.com/cython/cython/commit/d3b92b0a4207ee070405d6f1ab8a75e95b53c56a";>d3b92b0
 Use Py3.7 instead of Py3.9 for the C89 build since CPython switched to C99 
at...
   Additional commits viewable in https://github.com/cython/cython/compare/3.0.7...3.0.8";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=cython&package-manager=pip&previous-version=3.0.7&new-version=3.0.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Hive memory issue with reading iceberg v2 from hive [iceberg]

2024-01-10 Thread via GitHub


github-actions[bot] commented on issue #6784:
URL: https://github.com/apache/iceberg/issues/6784#issuecomment-1885963748

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Delta name mapping to Iceberg conversion [iceberg]

2024-01-10 Thread via GitHub


github-actions[bot] closed issue #6768: Support Delta name mapping to Iceberg 
conversion
URL: https://github.com/apache/iceberg/issues/6768


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Delta name mapping to Iceberg conversion [iceberg]

2024-01-10 Thread via GitHub


github-actions[bot] commented on issue #6768:
URL: https://github.com/apache/iceberg/issues/6768#issuecomment-1885963774

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Hive memory issue with reading iceberg v2 from hive [iceberg]

2024-01-10 Thread via GitHub


github-actions[bot] closed issue #6784: Hive memory issue with reading iceberg 
v2 from hive 
URL: https://github.com/apache/iceberg/issues/6784


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Create JUnit5 version of TableTestBase [iceberg]

2024-01-10 Thread via GitHub


lisirrx commented on code in PR #9217:
URL: https://github.com/apache/iceberg/pull/9217#discussion_r1448172377


##
core/src/test/java/org/apache/iceberg/TestManifestReader.java:
##
@@ -32,17 +32,15 @@
 import org.apache.iceberg.types.Types;
 import org.assertj.core.api.Assertions;
 import 
org.assertj.core.api.recursive.comparison.RecursiveComparisonConfiguration;
-import org.junit.Assert;
-import org.junit.Assume;
-import org.junit.Test;
-import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
-
-@RunWith(Parameterized.class)
-public class TestManifestReader extends TableTestBase {
-  @Parameterized.Parameters(name = "formatVersion = {0}")
-  public static Object[] parameters() {
-return new Object[] {1, 2};
+import org.junit.jupiter.api.Assumptions;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+
+@ExtendWith(ParameterizedTestExtension.class)

Review Comment:
   @nastra I noticed you have merged #9424, so should this pr be closed?
   BTW, Should #9085 be split into multiple tasks? I want to work on it but 
there are too many tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Discussion: Rethink `PrimitiveLiteral`. [iceberg-rust]

2024-01-10 Thread via GitHub


liurenjie1024 commented on issue #159:
URL: https://github.com/apache/iceberg-rust/issues/159#issuecomment-1886064501

   > @liurenjie1024 mentioned error messages as another use cases. That's the 
only time that the i128 representation might not be suitable. The question is 
whether the error messages warrant a more complex implementation.
   > 
   > Regarding @Fokko's example: Doesn't initially storing the Decimal as a 
LiteralFloat loose accuracy because the 3.25 is stored as something like 
3.2499987. If you then convert it to Decimal, it's inaccurate. Maybe you 
could use PrimitiveLiteral::String here.
   
   Error message is just an example, not all use cases. For example when we 
convert unbound expression to bound expression, how do we know its original 
scale?
   
   String is enough for storing decimal, or everything, but it maybe weird in 
api, since with only a string we don't know its original type, e.g. user may 
write an unbound expression like `a < "3.23"`, where `a` is a decimal and it's 
legal to compare it with string.
   
   I do admit that introducing another enum will be difficult maintain, maybe 
the solution suggested by @ZENOTME is great:
   
   ```
   struct Datum {
 typ: PrimitiveType,
 literal: PrimitiveLiteral
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]

2024-01-10 Thread via GitHub


liurenjie1024 commented on PR #160:
URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886079899

   cc @Fokko PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]

2024-01-10 Thread via GitHub


Fokko commented on PR #160:
URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886115997

   Looks like there are conflicts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]

2024-01-10 Thread via GitHub


liurenjie1024 commented on PR #160:
URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886155116

   cc @hiirrxnn Could you rebase with main branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

2024-01-10 Thread via GitHub


wooyeong commented on PR #9455:
URL: https://github.com/apache/iceberg/pull/9455#issuecomment-1886170959

   I need your opinion regarding [failed 
tests](https://github.com/apache/iceberg/actions/runs/7475896517/job/20353547835?pr=9455).
   
   Previously SparkTable used only `name` intentionally according to [the 
original code's 
comment](https://github.com/apache/iceberg/blob/53a1c8671dd1b9b93f4a857230008c812d79ddbf/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L408),
 to invalidate cache when needed such as 
[`rollback_to_snapshot`](https://iceberg.apache.org/docs/1.3.1/spark-procedures/#rollback_to_snapshot)
 (_This procedure invalidates all cached Spark plans that reference the 
affected table._).
   
   However, this patch introduced effectiveSnapshotId, it can't invalidate 
cache in this case because the altered table will have another snapshot id.
   
   - Cache table at snapshot 1
   - Alter table, now it has snapshot 2
   - Try to invalidate the cache for the table, but they have different 
snapshot ids(cached plan's id: 1 != new plan's id: 2) so it cannot be 
invalidated.
   
   Therefore, I think it's better to use my previous approach, checking `name`, 
`branch`, and `snapshotId` altogether.
   
   Whenever `snapshotId` or `branch` is supplied, it points to a fixed snapshot 
point to the table. Thus we don't need to invalidate cache even if the table is 
modified. In general cases, it acts like before, we can invalidate the cache 
when necessary.
   
   The downside of this approach is that we may lose optimization chances. The 
optimizer doesn't know both sub-queries are the same thing, so it will fetch 
both of them, but that **does not affect** the query result.
   
   ```sql
   SELECT * FROM iceberg_except_test
   UNION
   SELECT * FROM iceberg_except_test
   VERSION AS OF '2024-01-01';
   
   -- optimized plan will be
   == Optimized Logical Plan ==
   Aggregate [id#30, a#31, b#32], [id#30, a#31, b#32]
   +- Union false, false
  :- RelationV2[id#30, a#31, b#32] local.iceberg_except_test
  +- RelationV2[id#33, a#34, b#35] local.iceberg_except_test
   
   -- but the result will be
   > 1   b   2024-01-01 00:00:00
   ```
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]

2024-01-10 Thread via GitHub


hiirrxnn commented on PR #160:
URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886208226

   But the PR with main branch has been closed . Should i do it anyway?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]

2024-01-10 Thread via GitHub


hiirrxnn commented on PR #160:
URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886231297

   Also , feature anyway has all the changes of main since it has been built on 
top of it , or should i rebase main with feature?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Migrate subclasses of FlinkCatalogTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


vinitpatni commented on PR #9381:
URL: https://github.com/apache/iceberg/pull/9381#issuecomment-1886232790

   > LGTM, thanks @vinitpatni. Could you also please remove 
`FlinkCatalogTestBase` as part of this PR as I don't think it's used anymore, 
since everything was converted.
   
   Ack. Removed FlinkCatalogTestBase and its references


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate remaining tests in source directory to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


chinmay-bhat commented on code in PR #9380:
URL: https://github.com/apache/iceberg/pull/9380#discussion_r1448306197


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSnapshotSelection.java:
##
@@ -44,57 +48,58 @@
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.RowFactory;
 import org.apache.spark.sql.SparkSession;
-import org.assertj.core.api.Assertions;
-import org.junit.AfterClass;
-import org.junit.Assert;
-import org.junit.BeforeClass;
-import org.junit.Rule;
-import org.junit.Test;
-import org.junit.rules.TemporaryFolder;
-import org.junit.runner.RunWith;
-import org.junit.runners.Parameterized;
-
-@RunWith(Parameterized.class)
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.junit.jupiter.api.io.TempDir;
+
+@ExtendWith(ParameterizedTestExtension.class)
 public class TestSnapshotSelection {
 
-  @Parameterized.Parameters(name = "planningMode = {0}")
+  @Parameters(name = "properties = {0}")
   public static Object[] parameters() {
-return new Object[] {LOCAL, DISTRIBUTED};
+return new Object[][] {
+  {
+ImmutableMap.of(
+TableProperties.DATA_PLANNING_MODE, LOCAL.modeName(),
+TableProperties.DELETE_PLANNING_MODE, LOCAL.modeName())
+  },
+  {
+ImmutableMap.of(
+TableProperties.DATA_PLANNING_MODE, DISTRIBUTED.modeName(),
+TableProperties.DELETE_PLANNING_MODE, DISTRIBUTED.modeName())
+  }
+};
   }
 
   private static final Configuration CONF = new Configuration();
   private static final Schema SCHEMA =
   new Schema(
   optional(1, "id", Types.IntegerType.get()), optional(2, "data", 
Types.StringType.get()));
 
-  @Rule public TemporaryFolder temp = new TemporaryFolder();
+  @TempDir private Path temp;
 
   private static SparkSession spark = null;
 
-  private final Map properties;
-
-  public TestSnapshotSelection(PlanningMode planningMode) {
-this.properties =
-ImmutableMap.of(
-TableProperties.DATA_PLANNING_MODE, planningMode.modeName(),
-TableProperties.DELETE_PLANNING_MODE, planningMode.modeName());
-  }
+  @Parameter(index = 0)
+  private Map properties;
 
-  @BeforeClass
-  public static void startSpark() {
+  @BeforeEach
+  public void startSpark() {

Review Comment:
   no reason, changed it to `BeforeAll` and `AfterAll`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate remaining tests in source directory to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


chinmay-bhat commented on code in PR #9380:
URL: https://github.com/apache/iceberg/pull/9380#discussion_r1448306351


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java:
##
@@ -123,15 +126,9 @@ public static Object[][] parameters() {
 };
   }
 
-  public TestPositionDeletesTable(
-  String catalogName, String implementation, Map config, 
FileFormat format) {
-super(catalogName, implementation, config);
-this.format = format;
-  }
-
-  @Rule public TemporaryFolder temp = new TemporaryFolder();
+  @TempDir private Path temp;

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


chinmay-bhat commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1448313906


##
core/src/test/java/org/apache/iceberg/TestBase.java:
##
@@ -173,7 +173,7 @@ public class TestBase {
   public TestTables.TestTable table = null;
 
   @Parameters(name = "formatVersion = {0}")
-  protected static List parameters() {
+  protected static List parameters() {

Review Comment:
   Because for some tests, we only send 1 `Object` in the list, which is 
satisfied by `List`.
For ex: from `DeleteFileIndexTestBase`
   ```
   @Parameters(name = "formatVersion = {0}")
 public static List parameters() {
   return Arrays.asList(new Object[] {2});
 }
   ```
   
   Having List does not allow for the above case as it always expects 
a List of Object arrays, where the Object arrays size cant be 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Migrate tests that depend on SparkDistributedDataScanTestBase to JUnit5 [iceberg]

2024-01-10 Thread via GitHub


chinmay-bhat commented on code in PR #9416:
URL: https://github.com/apache/iceberg/pull/9416#discussion_r1448315751


##
core/src/test/java/org/apache/iceberg/FilterFilesTestBase.java:
##
@@ -24,67 +24,67 @@
 import java.io.File;
 import java.io.IOException;
 import java.nio.ByteBuffer;
+import java.nio.file.Files;
+import java.nio.file.Path;
 import java.util.Map;
 import org.apache.iceberg.expressions.Expressions;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
 import org.apache.iceberg.relocated.com.google.common.collect.Maps;
 import org.apache.iceberg.types.Conversions;
 import org.apache.iceberg.types.Types;
-import org.junit.After;
-import org.junit.Before;
-import org.junit.Rule;
-import org.junit.Test;
-import org.junit.rules.TemporaryFolder;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.junit.jupiter.api.io.TempDir;
 
+@ExtendWith(ParameterizedTestExtension.class)
 public abstract class FilterFilesTestBase<
 ScanT extends Scan, T extends ScanTask, G extends 
ScanTaskGroup> {
 
-  public final int formatVersion;
-
-  public FilterFilesTestBase(int formatVersion) {
-this.formatVersion = formatVersion;
-  }
+  @Parameter(index = 0)
+  public int formatVersion;

Review Comment:
   can be switched to protected. updating



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] #154 : Add homepage to Cargo.toml [iceberg-rust]

2024-01-10 Thread via GitHub


liurenjie1024 commented on PR #160:
URL: https://github.com/apache/iceberg-rust/pull/160#issuecomment-1886347244

   > Also , feature anyway has all the changes of main since it has been built 
on top of it , or should i rebase main with feature?
   
   Hi, @hiirrxnn You can learn how to resolve conflicts here: 
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2024-01-10 Thread via GitHub


nk1506 commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1448339394


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##
@@ -142,9 +143,15 @@ public FileIO io() {
   @Override
   protected void doRefresh() {
 String metadataLocation = null;
+Table table = null;
+
 try {
-  Table table = metaClients.run(client -> client.getTable(database, 
tableName));
-  HiveOperationsBase.validateTableIsIceberg(table, fullName);
+  table = metaClients.run(client -> client.getTable(database, tableName));
+  HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName);
+
+  if 
(table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {

Review Comment:
   I have segregated Iceberg 
[table](https://github.com/apache/iceberg/pull/8907/files#diff-e502621d52f86cf0ec3187dda30ac61f6b76efb7b6276bc8d233ccb2c836fb98R151)
 and Iceberg 
[View](https://github.com/apache/iceberg/pull/8907/files#diff-db46657b84d66e084e15f31b8dab21577efb2ae7102863f94c6c9477782de676R83)
 check here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2024-01-10 Thread via GitHub


nk1506 commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1448341650


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,287 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Collections;
+import java.util.Locale;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchIcebergViewException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.MetastoreOperationsUtil;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations implements 
HiveOperationsBase {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveViewOperations.class);
+
+  private final String fullName;
+  private final String database;
+  private final String viewName;
+  private final FileIO fileIO;
+  private final ClientPool metaClients;
+  private final long maxHiveTablePropertySize;
+
+  HiveViewOperations(
+  Configuration conf,
+  ClientPool metaClients,
+  FileIO fileIO,
+  String catalogName,
+  TableIdentifier viewIdentifier) {
+String dbName = viewIdentifier.namespace().level(0);
+this.metaClients = metaClients;
+this.fileIO = fileIO;
+this.fullName = BaseMetastoreCatalog.fullTableName(catalogName, 
viewIdentifier);
+this.database = dbName;
+this.viewName = viewIdentifier.name();
+this.maxHiveTablePropertySize =
+conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, 
HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT);
+  }
+
+  @Override
+  public void doRefresh() {
+String metadataLocation = null;
+Table table = null;
+
+try {
+  table = metaClients.run(client -> client.getTable(database, viewName));
+  HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName);
+  metadataLocation =
+  
table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+
+} catch (NoSuchObjectException e) {
+  if (currentMetadataLocation() != null) {
+throw new NoSuchViewException("View does not exist: %s.%s", database, 
viewName);
+  }
+} catch (TException e) {
+  String errMsg =
+  String.format("Failed to get view info from metastore %s.%s", 
database, viewName);
+  throw new RuntimeException(errMsg, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted during refresh", e);
+}
+
+if (table != null && 
!table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  disableRefresh();
+} else {
+  refreshFromMetadataLocation(metadataLocation);
+}
+  }
+
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")
+  @Override
+  public void doCommit(ViewMetadata base, ViewMetadata metadata) {

Review Comment:
   I thought adding lock for view creation is expensive and unnecessary. Since 
view is more like a read definition. With View we are not doing any data write. 
   Please share your thoughts. 



-- 
This is an automated message from the Apache Git Service.
To respon

Re: [PR] Spark: Fix SparkTable to use name and effective snapshotID for comparing [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on PR #9455:
URL: https://github.com/apache/iceberg/pull/9455#issuecomment-1886371354

   Yeah. I need to think bit more. Looks like just using `lazyFixedSnapshotId` 
in equals and hashcode is enough instead of effectiveSnapshotID (which uses the 
current snapshot).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Data: Errors in some file readers do not report the file in which they failed [iceberg]

2024-01-10 Thread via GitHub


yyy1000 commented on issue #9458:
URL: https://github.com/apache/iceberg/issues/9458#issuecomment-1886386123

   I'd like to help with it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] init writer framework [iceberg-rust]

2024-01-10 Thread via GitHub


liurenjie1024 commented on code in PR #135:
URL: https://github.com/apache/iceberg-rust/pull/135#discussion_r1448357734


##
crates/iceberg/src/writer/file_writer/mod.rs:
##
@@ -0,0 +1,51 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Iceberg File Writer
+
+use super::{CurrentFileStatus, IcebergWriteResult};
+use crate::Result;
+use arrow_array::RecordBatch;
+use arrow_schema::SchemaRef;
+
+/// File writer builder trait.
+#[async_trait::async_trait]
+pub trait FileWriterBuilder: Send + Clone + 'static {
+/// The associated file writer type.
+type R: FileWriter;
+/// Build file writer.
+async fn build(self, schema: &SchemaRef) -> Result;
+}
+
+/// File writer focus on writing record batch to different physical file 
format.(Such as parquet. orc)
+#[async_trait::async_trait]
+pub trait FileWriter: Send + 'static + CurrentFileStatus {
+/// The associated file write result type.
+type R: FileWriteResult;
+/// Write record batch to file.
+async fn write(&mut self, batch: &RecordBatch) -> Result<()>;
+/// Close file writer.
+async fn close(self) -> Result>;
+}
+
+/// File write result.
+pub trait FileWriteResult: Send + 'static {
+/// The associated iceberg write result type.
+type R: IcebergWriteResult;
+/// Convert to iceberg write result.
+fn to_iceberg_result(self) -> Self::R;

Review Comment:
   Yes, I also think the `FileWriteResult`/`IcebergWriterResult` is a little 
too complicated. What `FileWriter` returns is just a partial `DataFile`, so a 
`DataFileBuilder` would be enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] If iceberg's parquet data file contains an array of time type, it cannot be read by spark library even after dropping this column [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on issue #9446:
URL: https://github.com/apache/iceberg/issues/9446#issuecomment-1886396291

   > Hi, if someone could guide me on some information about the history of 
batchedReaderFunc vs ReaderFunc and some related testing code path, happy to 
work on the fix for that.
   
   Not worked on that area before. I would wait for others answers. 
   From a high level code review, `batchedReaderFunc` uses `VectorizedReader` 
to read a batch (few rows) of vectors (individual column is on vector). Where 
as `ReaderFunc` looks like a reader for one type of column


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] If iceberg's parquet data file contains an array of time type, it cannot be read by spark library even after dropping this column [iceberg]

2024-01-10 Thread via GitHub


ajantha-bhat commented on issue #9446:
URL: https://github.com/apache/iceberg/issues/9446#issuecomment-1886398118

   If you look at the caller of `Parquet.ReadBuilder.createReaderFunc`, you can 
find the testcases. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Caused by: java.net.SocketException: Connection reset [iceberg]

2024-01-10 Thread via GitHub


javrasya commented on issue #9444:
URL: https://github.com/apache/iceberg/issues/9444#issuecomment-1886422859

   This happens more often when consumption rate is high which is like 
replaying historical messages. When I run it in unbounded streaming mode and 
use `INCREMENTAL_FROM_EARLIEST_SNAPSHOT` streaming strategy instead of batch 
mode, the consumption rate drops inherently and this error occurs way less in a 
way that my app fails but recovers and continues and reaches to the end, very 
slowly but I will take it. 
   
   Could this be happening because S3 is throttling or something, is there 
anyone else observed anything like this before? 
   
   **Note:** The upstream is committing every minute which means that we are 
having new snapshot every minute which can also lead too many small files and 
this service which is having the respective error in the original post might be 
needing to pull too many files and eventually hitting that connection reset 
issue. This is just a theory, I couldn't verify it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org