Re: [I] EMR 6.10.0 Cannot migrate a table from a non-Iceberg Spark Session Catalog. Found spark_catalog [iceberg]

2023-10-27 Thread via GitHub


tomtongue commented on issue #7317:
URL: https://github.com/apache/iceberg/issues/7317#issuecomment-1782418606

   Sorry for jumping in. I personally investigated the migrate query issue for 
GlueCatalog, so let me share my investigation result.
   
   ## Result
   Currently, it’s NOT possible to run `migrate` query for Spark/Hive tables in 
Glue Data Catalog. The reason of this is that GlueCatalog client doesn’t 
support renaming tables currently. 
   Let me elaborate that below. If I’m wrong, please correct me.
   
   ## Details
   When running the `migrate`  query for a Spark/Hive table in Glue Data 
Catalog, as described above, the `SparkSessionCatalog` configuration should be 
specified like `  .config("spark.sql.catalog.spark_catalog", 
"org.apache.iceberg.spark.SparkSessionCatalog")`.
   
   In this case, the source table in the `migrate` query like `table => 
'$db.$table'` is always set to `spark_catalog` (if other catalog is specified, 
the Spark application will fail).
   
   For this, in the current design of `migrate` , the code path always goes 
through 
[`SparkSessionCatalog.renameTable`](https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java#L293)
 because as its specification, the `migrate`  creates a staging table, renames 
the source table to keep the table as backup, and then migrate the source table 
to Iceberg. After the migration, the back table is dropped or not based on the 
`drop_backup` parameter. In the phase of renaming the source table to keep the 
backup table, the `SparkSessionCatalog.renameTable` is called. 
   
   The `SparkSessionCatalog.renameTable` can handle the IcebergCatalog to 
rename the table in GlueCatalog, the method basically checks the source table 
and if the source table is Iceberg, then calls `IcebergCatalog.renameTable` 
(GlueCatalogImpl is specified here, so the `renameTable` in GlueCatalogImpl 
will be used). However, in this case, the source table always belongs to 
`spark_catalog`, therefore the code path always goes to 
`getSessionCatalog().renameTable` as follows:
   
   
https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java#L293
   ```java
 @Override
 public void renameTable(Identifier from, Identifier to)
 throws NoSuchTableException, TableAlreadyExistsException {
   // rename is not supported by HadoopCatalog. to avoid 
UnsupportedOperationException for session
   // catalog tables,
   // check table existence first to ensure that the table belongs to the 
Iceberg catalog.
   if (icebergCatalog.tableExists(from)) {
 icebergCatalog.renameTable(from, to);
   } else {
 getSessionCatalog().renameTable(from, to);  // <= THIS PATH
   }
 }
   
   ```
   
   `getSessionCatalog().renameTable` calls Hive APIs for the table in Glue Data 
Catalog, so it fails due to renaming failure. 
   
   
   Here’s the detail of calling flow (in Iceberg 1.4.1 with Spark 3.5):
   1. 
https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/MigrateTableProcedure.java#L76
 -> Calls `MigrateTableSparkAction`
   2. 
https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java#L118
 -> The actual migration impl
  1. `renameAndBackupSourceTable()` is called to keep the backup
  2. 
https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java#L209
 -> `renameAndBackupSourceTable`. `destCatalog().renameTable(...)` will be 
called. But the `destCatalog()` is defined by `this.destCatalog = 
checkDestinationCatalog(sourceCatalog);` in the L66 in the same Class.
   3. 
https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java#L293
 -> The `destCatalog` is `SparkSessionCatalog` in step 2, so the 
`getSessionCatalog().renameTable` will be called.
   
   ## Resolution
   If the GlueCatalog renameTable can be used to keep the backup table, it’s 
possible to run the `migrate`. To resolve this, for example, it’s possible to 
add a new option to specify the destination catalog.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-10-27 Thread via GitHub


ajantha-bhat commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r137451


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java:
##
@@ -106,12 +152,12 @@ private Schema getTestSchema() {
   public void testCreateTableBuilder() throws Exception {
 Schema schema = getTestSchema();
 PartitionSpec spec = PartitionSpec.builderFor(schema).bucket("data", 
16).build();
-TableIdentifier tableIdent = TableIdentifier.of(DB_NAME, "tbl");
+TableIdentifier tableIdent = TableIdentifier.of(HIVE_DB_NAME, "tbl");
 String location = temp.resolve("tbl").toString();
 
 try {
   Table table =
-  catalog
+  catalog()

Review Comment:
   If you keep a local variable `HiveCatalog catalog` and initialize it in 
`before()`, you don't have to change to a method call everywhere;



##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java:
##
@@ -82,12 +83,20 @@
 import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.JsonUtil;
 import org.apache.thrift.TException;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
 import org.junit.jupiter.api.io.TempDir;
 import org.junit.jupiter.params.ParameterizedTest;
 import org.junit.jupiter.params.provider.ValueSource;
 
-public class TestHiveCatalog extends HiveMetastoreTest {
+/**
+ * Run all the tests from abstract of {@link CatalogTests}. Also, a few 
specific tests for HIVE too.
+ * There could be some duplicated tests that are already being covered with 
{@link CatalogTests}
+ * //TODO: remove duplicate tests with {@link CatalogTests}.Also use the 
DB/TABLE/SCHEMA from {@link

Review Comment:
   Can you please check and remove in this PR only? I don't think it should be 
a separate change. 



##
hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreSetup.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
+import org.apache.iceberg.CatalogProperties;
+import org.apache.iceberg.CatalogUtil;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+
+/**
+ * Setup HiveMetastore. It does not create any database. All the tests should 
create a database
+ * accordingly. it should replace the existing setUp class {@link 
HiveMetastoreTest}
+ */
+class HiveMetastoreSetup {

Review Comment:
   we cannot deprecate `HiveMetastoreTest` as it is used by many other classes. 
I am not sure the problem with that class. The methods we want are static, so 
there we can all those methods in beforeAll and afterAll right?  I still think 
there is no need of this class. 



##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -261,6 +261,12 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   throw new RuntimeException("Interrupted in call to rename", e);
+} catch (RuntimeException e) {

Review Comment:
   So, this change is needed now because `CatalogTests` is expecting that 
particular type of message? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-10-27 Thread via GitHub


ajantha-bhat commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1374227697


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveViewCatalog.java:
##
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+
+import java.util.Collections;
+import org.apache.iceberg.Transaction;
+import org.apache.iceberg.catalog.Catalog;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.view.ViewCatalogTests;
+import org.assertj.core.api.Assumptions;
+import org.junit.jupiter.api.AfterEach;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+public class TestHiveViewCatalog extends ViewCatalogTests {
+
+  private HiveMetastoreSetup hiveMetastoreSetup;
+
+  @BeforeEach
+  public void before() throws Exception {
+hiveMetastoreSetup = new HiveMetastoreSetup(Collections.emptyMap());
+  }
+
+  @AfterEach
+  public void after() throws Exception {
+hiveMetastoreSetup.stopMetastore();
+  }
+
+  @Override
+  protected HiveCatalog catalog() {
+return hiveMetastoreSetup.catalog;
+  }
+
+  @Override
+  protected Catalog tableCatalog() {
+return hiveMetastoreSetup.catalog;
+  }
+
+  @Override
+  protected boolean requiresNamespaceCreate() {
+return true;
+  }
+
+  // Override few tests which are using AlreadyExistsException instead of 
NoSuchViewException

Review Comment:
   I have some of this problem for Nessie catalog extending `ViewCatalogTests`. 
Should we change ViewCatalogTests to accept multiple options like I did for 
Nessie 
(https://github.com/apache/iceberg/pull/8909/files#diff-9c8b69f86aec86aea82cd6675c1c24267d9a9d77d7715c2c792934cb954b328bR403-R409)
 or generalize the code to throw expected exception? 



##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -264,6 +279,138 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 }
   }
 
+  @Override
+  public boolean viewExists(TableIdentifier identifier) {
+return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, 
TableType.VIRTUAL_VIEW);
+  }
+
+  @Override
+  public boolean dropView(TableIdentifier identifier) {
+if (!isValidIdentifier(identifier)) {
+  return false;
+}
+try {
+  String database = identifier.namespace().level(0);
+  String viewName = identifier.name();
+  Table table = clients.run(client -> client.getTable(database, viewName));
+  HiveCatalogUtil.validateTableIsIcebergView(table, fullTableName(name, 
identifier));
+  clients.run(
+  client -> {
+client.dropTable(database, viewName);
+return null;
+  });
+  LOG.info("Dropped View: {}", identifier);
+  return true;
+
+} catch (NoSuchViewException | NoSuchObjectException e) {
+  LOG.info("Skipping drop, View does not exist: {}", identifier, e);
+  return false;
+} catch (TException e) {
+  throw new RuntimeException("Failed to drop " + identifier, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted in call to dropView", e);
+}
+  }
+
+  @Override
+  public List listViews(Namespace namespace) {

Review Comment:
   Can we just pass TableType and reduce code duplication like I did for 
Nessie? 
   
https://github.com/apache/iceberg/pull/8909/files#diff-4abd6d90c69587709bc754be9ea305080395bad9ac778176a0e3ab2563ac8464R145
   
   Similar comment for all the view related APIs
   



##
hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreSetup.java:
##
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "

[I] Slow RewriteManifests due to Validation of Manifest Entries [iceberg]

2023-10-27 Thread via GitHub


mirageyjd opened a new issue, #8932:
URL: https://github.com/apache/iceberg/issues/8932

   ### Apache Iceberg version
   
   0.13.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We ran `BaseRewriteManifestsSparkAction` action on a large table with 7k+ 
manifests in Spark, and it took more than an hour unexpectedly. The most 
time-consuming procedure is to validate that each manifest entry in added 
manifests has a snapshot id, which is not executed in a distributed manner. 
Without the validation, the entire action takes less than 2 minutes.
   
   I wonder whether it is necessary to validate snapshot id of each manifest 
entry in manifests written by `BaseRewriteManifestsSparkAction`. It would be 
better such validation is optional and can be skipped 
in`BaseRewriteManifestsSparkAction`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-27 Thread via GitHub


findepi commented on PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1782599430

   > Now that we have separate types, we can use the type to carry that 
information, so that you can promote a `long` to `timestamp_ms` or `timestamp` 
and we know how to interpret the value.
   
   That's true as long as there is only type promotion only once.
   
   What should happen when `timestamp_ms` is promoted to `timestamp`?
   If it wasn't a `long` before, such promotion should be permitted.
   If it was a `long` before... how do we handle this case?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg streaming streaming-skip-overwrite-snapshots SparkMicroBatchStream only skips over one file per trigger [iceberg]

2023-10-27 Thread via GitHub


cccs-jc commented on issue #8902:
URL: https://github.com/apache/iceberg/issues/8902#issuecomment-1782660563

   @singhpk234  do you know why the `existingFilesCount` are added to the 
count. Seems like it should only add the number of `addedFilesCount` .
   
   
https://github.com/apache/iceberg/blob/e2b56daf35724700a9b57dbeee5fe23f99c592c4/core/src/main/java/org/apache/iceberg/MicroBatches.java#L95
   
   In the test cases there are no snapshot with multiple manifest list so I'm 
not 100% sure it's correct or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Fix typo in `_primitive_to_phyisical` [iceberg-python]

2023-10-27 Thread via GitHub


whisk commented on issue #107:
URL: https://github.com/apache/iceberg-python/issues/107#issuecomment-1782706542

   Hi @Fokko, please consider PR #108 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Slow RewriteManifests due to Validation of Manifest Entries [iceberg]

2023-10-27 Thread via GitHub


RussellSpitzer commented on issue #8932:
URL: https://github.com/apache/iceberg/issues/8932#issuecomment-1782759994

   Do you have a flame graph or some evidence of this? My gut would say that 
would be a trivial check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support ser/deser of value [iceberg-rust]

2023-10-27 Thread via GitHub


ZENOTME commented on code in PR #82:
URL: https://github.com/apache/iceberg-rust/pull/82#discussion_r1374447735


##
crates/iceberg/src/avro/schema.rs:
##
@@ -30,6 +32,11 @@ use itertools::{Either, Itertools};
 use serde_json::{Number, Value};
 
 const FILED_ID_PROP: &str = "field-id";
+const UUID_BYTES: usize = 16;
+const UUID_LOGICAL_TYPE: &str = "uuid";
+// # TODO
+// This const may better to maintain in avro-rs.

Review Comment:
   Have added a #86 to track them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink: Add support for Flink 1.18 [iceberg]

2023-10-27 Thread via GitHub


YesOrNo828 commented on issue #8930:
URL: https://github.com/apache/iceberg/issues/8930#issuecomment-1782883251

   > then we will have a transitive dependency on Flink 1.18. We can exclude 
the dependency, but how can we make sure that everything works as expected 
without running the tests. 
   
   You're right. There'll be a transitive dependency. If Flink1.17's APIs are 
incompatible with Flink1.18, I think we can copy some tests into the Flink1.17 
module to ensure it works fine and add end-to-end tests to guarantee 
compatibility with different Flink versions
   
   > As an alternative we can move these classes around every time when it is 
needed - but this approach has its own maintenance costs. 
   
   As of Flink 1.18, released a few days ago, the Flink community has 
externalized the [JDBC](https://github.com/apache/flink-connector-jdbc), 
[ES](https://github.com/apache/flink-connector-elasticsearch), 
[Kafka](https://github.com/apache/flink-connector-kafka), 
[pulsar](https://github.com/apache/flink-connector-pulsar), 
[HBase](https://github.com/apache/flink-connector-hbase), and so on connectors. 
That means the Flink API has become more compatible. So I think this 
maintenance cost between different Flink versions is acceptable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Iceberg 1.3.0 jc streaming [iceberg]

2023-10-27 Thread via GitHub


cccs-jc closed pull request #8934: Iceberg 1.3.0 jc streaming
URL: https://github.com/apache/iceberg/pull/8934


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Iceberg 1.3.0 jc streaming [iceberg]

2023-10-27 Thread via GitHub


cccs-jc commented on PR #8934:
URL: https://github.com/apache/iceberg/pull/8934#issuecomment-1782901768

   not ready


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] patch: Parquet Column Names with "Special Characters" fix [iceberg-python]

2023-10-27 Thread via GitHub


MarquisC opened a new pull request, #109:
URL: https://github.com/apache/iceberg-python/pull/109

   We're using PyIceberg to read Iceberg tables stored in S3 as parquet. We 
have column names in the form of `id:foo` `diagnostic:bar` using `:` as a sort 
of delimiter to help us do some programatic maintenance on our side.
   
   In Parquet the column names are magically subbed in this case `:` -> `_x3A` 
and upon attempts at scanning/reading the data the schema of the table doesn't 
match the physical column names for PyArrow.
   
   The first pass is a naive fix for this that I have tested and works, but I'm 
looking for guidance on where you all want me to put this logic, and I'm happy 
to add it there instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] patch: Parquet Column Names with "Special Characters" fix [iceberg-python]

2023-10-27 Thread via GitHub


mchamberlain-mdsol commented on PR #109:
URL: https://github.com/apache/iceberg-python/pull/109#issuecomment-1783097367

   Exception Example:
   https://github.com/apache/iceberg-python/assets/110425760/74a7f812-333b-45ee-861c-cf581a99f3d3";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Avoid extra copies of manifests while optimizing V2 tables [iceberg]

2023-10-27 Thread via GitHub


singhpk234 commented on code in PR #8928:
URL: https://github.com/apache/iceberg/pull/8928#discussion_r1374737812


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -443,6 +444,14 @@ public void testBasicManifestReplacement() throws 
IOException {
 List manifests = snapshot.allManifests(table.io());
 Assert.assertEquals(3, manifests.size());
 
+if (formatVersion == 1) {
+  
assertThat(manifests.get(0).path()).isNotEqualTo(firstNewManifest.path());
+  
assertThat(manifests.get(1).path()).isNotEqualTo(secondNewManifest.path());
+} else {
+  assertThat(manifests.get(0).path()).isEqualTo(firstNewManifest.path());
+  assertThat(manifests.get(1).path()).isEqualTo(secondNewManifest.path());
+}

Review Comment:
   [minor] A comment might be helpful 
   shouldn't we check if it's format = 1 and snapshotIdInheritence enabled ? 



##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -443,6 +444,14 @@ public void testBasicManifestReplacement() throws 
IOException {
 List manifests = snapshot.allManifests(table.io());
 Assert.assertEquals(3, manifests.size());
 
+if (formatVersion == 1) {
+  
assertThat(manifests.get(0).path()).isNotEqualTo(firstNewManifest.path());
+  
assertThat(manifests.get(1).path()).isNotEqualTo(secondNewManifest.path());
+} else {
+  assertThat(manifests.get(0).path()).isEqualTo(firstNewManifest.path());
+  assertThat(manifests.get(1).path()).isEqualTo(secondNewManifest.path());
+}

Review Comment:
   [minor] shouldn't we check if it's format = 1 and snapshotIdInheritence 
enabled ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fix Migrate procedure renaming issue for custom catalog [iceberg]

2023-10-27 Thread via GitHub


singhpk234 commented on code in PR #8931:
URL: https://github.com/apache/iceberg/pull/8931#discussion_r1374769692


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java:
##
@@ -108,6 +109,23 @@ public MigrateTableSparkAction backupTableName(String 
tableName) {
 return this;
   }
 
+  @Override
+  public MigrateTableSparkAction destCatalogName(String catalogName) {
+CatalogManager catalogManager = spark().sessionState().catalogManager();
+
+CatalogPlugin catalogPlugin;
+if (catalogManager.isCatalogRegistered(catalogName)) {
+  catalogPlugin = catalogManager.catalog(catalogName);
+} else {
+  LOG.warn(
+  "{} doesn't exist in SparkSession. " + "Fallback to current 
SparkSession catalog.",
+  catalogName);
+  catalogPlugin = catalogManager.currentCatalog();
+}
+this.destCatalog = checkDestinationCatalog(catalogPlugin);

Review Comment:
   QQ: earlier we use to use sourceCatalog as destCatalog too, was this a 
problem ? can you please add more comments as to why sourceCatalog was picked 
as `spark_catalog` rather than the `glue_catalog` ?? since we were calling the 
migrate procedure from glue_catalog ? 



##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java:
##
@@ -142,6 +142,48 @@ public void testMigrateWithBackupTableName() throws 
IOException {
 Assertions.assertThat(spark.catalog().tableExists(dbName + "." + 
backupTableName)).isTrue();
   }
 
+  @Test
+  public void testMigrateWithDestCatalogName() throws IOException {
+Assume.assumeTrue(catalogName.equals("spark_catalog"));
+
+spark
+.conf()
+.set("spark.sql.catalog.spark_catalog", 
"org.apache.iceberg.spark.SparkSessionCatalog");
+
+String location = temp.newFolder().toString();
+sql(
+"CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet 
LOCATION '%s'",
+tableName, location);
+sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
+
+Object result =
+scalarSql(
+"CALL %s.system.migrate(table => '%s', drop_backup => false, 
dest_catalog_name => '%s')",
+catalogName, tableName, catalogName);
+Assertions.assertThat(result).isEqualTo(1L);
+Assertions.assertThat(spark.catalog().tableExists(tableName + 
"_BACKUP_")).isTrue();
+  }
+
+  @Test
+  public void testMigrateWithDestCatalogNameWithNonExistingCatalog() throws 
IOException {
+Assume.assumeTrue(catalogName.equals("spark_catalog"));
+
+String destCatalogName = "non_existing_catalog";
+
+String location = temp.newFolder().toString();
+sql(
+"CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet 
LOCATION '%s'",
+tableName, location);
+sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
+
+Object result =
+scalarSql(
+"CALL %s.system.migrate(table => '%s', drop_backup => false, 
dest_catalog_name => '%s')",
+catalogName, tableName, destCatalogName);
+Assertions.assertThat(result).isEqualTo(1L);
+Assertions.assertThat(spark.catalog().tableExists(tableName + 
"_BACKUP_")).isTrue();

Review Comment:
   should check the warning log in this case, was wondering if giving incorrect 
destCatalogName should be thrown as Invalid input



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Partitioning by Year/Month/Day [iceberg]

2023-10-27 Thread via GitHub


l20DfX35JnKBfRn commented on issue #4129:
URL: https://github.com/apache/iceberg/issues/4129#issuecomment-1783179835

   If anyone is confused about this error on AWS Athena, unlike with hive-style 
file partitioning in S3 like `{table-s3-location}/year={}/month={}/day={}/`
   
   The equivalent in iceberg, e.g.
   ```SQL
   CREATE TABLE documents (
   id string,
   created_at timestamp
   ...
   )
   PARTITIONED BY (
day(created_at)
);
   LOCATION 's3://{table-s3-location}/'
   TBLPROPERTIES ('table_type' = 'ICEBERG');
   ```
   
   This will create separate partitions for each day exactly like hive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-27 Thread via GitHub


jacobmarble commented on code in PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#discussion_r1374794900


##
format/spec.md:
##
@@ -187,10 +189,11 @@ A **`map`** is a collection of key-value pairs with a key 
type and a value type.
 Notes:
 
 1. Decimal scale is fixed and cannot be changed by schema evolution. Precision 
can only be widened.
-2. All time and timestamp values are stored with microsecond precision.
-- Timestamps _with time zone_ represent a point in time: values are stored 
as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is 
stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered 
identical).
-- Timestamps _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp 
values are stored as a long that encodes microseconds from the unix epoch.
+2. `time`, `timestamp`, and `timestamptz` values are represented with 
_microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are 
represented with _nanosecond precision_.
+- Timestamp values _with time zone_ represent a point in time: values are 
stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` 
is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are 
considered identical).
+- Timestamp values _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables.

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-27 Thread via GitHub


jacobmarble commented on PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1783184731

   > > Now that we have separate types, we can use the type to carry that 
information, so that you can promote a `long` to `timestamp_ms` or `timestamp` 
and we know how to interpret the value.
   > 
   > That's true as long as there is only type promotion only once.
   > 
   > What should happen when `timestamp_ms` is promoted to `timestamp`? If it 
wasn't a `long` before, such promotion should be permitted. If it was a `long` 
before... how do we handle this case?
   
   FWIW, the millisecond timestamp types (`timestamp_ms`, `timestamptz_ms`) 
have been removed from this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Avoid extra copies of manifests while optimizing V2 tables [iceberg]

2023-10-27 Thread via GitHub


aokolnychyi commented on code in PR #8928:
URL: https://github.com/apache/iceberg/pull/8928#discussion_r1374826073


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -443,6 +444,14 @@ public void testBasicManifestReplacement() throws 
IOException {
 List manifests = snapshot.allManifests(table.io());
 Assert.assertEquals(3, manifests.size());
 
+if (formatVersion == 1) {
+  
assertThat(manifests.get(0).path()).isNotEqualTo(firstNewManifest.path());
+  
assertThat(manifests.get(1).path()).isNotEqualTo(secondNewManifest.path());
+} else {
+  assertThat(manifests.get(0).path()).isEqualTo(firstNewManifest.path());
+  assertThat(manifests.get(1).path()).isEqualTo(secondNewManifest.path());
+}

Review Comment:
   There are actually two different tests. This one called basicReplacement and 
tests the default behavior. The one below explicitly validates enabled snapshot 
ID inheritance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Fix typo in `_primitive_to_phyisical` [iceberg-python]

2023-10-27 Thread via GitHub


Fokko commented on issue #107:
URL: https://github.com/apache/iceberg-python/issues/107#issuecomment-1783395139

   Nice, thanks @whisk 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Fixed typos [iceberg-python]

2023-10-27 Thread via GitHub


Fokko merged PR #108:
URL: https://github.com/apache/iceberg-python/pull/108


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] patch: Parquet Column Names with "Special Characters" fix [iceberg-python]

2023-10-27 Thread via GitHub


Fokko commented on PR #109:
URL: https://github.com/apache/iceberg-python/pull/109#issuecomment-1783408073

   Thanks for raising this @MarquisC. This looks like 
https://github.com/apache/iceberg-python/pull/83/, can you check if that also 
resolves your problem? Otherwise, I think it will be a good place to add it 
here as well. It also shows how to test this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] added contributing.md file [iceberg-python]

2023-10-27 Thread via GitHub


Fokko commented on PR #102:
URL: https://github.com/apache/iceberg-python/pull/102#issuecomment-1783409103

   @onemriganka What do you think of pointing to 
https://py.iceberg.apache.org/contributing/? Otherwise we have to maintain this 
at two places


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-27 Thread via GitHub


findepi commented on PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1783460277

   @jacobmarble indeed, thanks!
   do we expect any type promotions being allowed around the new types being 
added here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-27 Thread via GitHub


Fokko commented on code in PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#discussion_r1375028979


##
format/spec.md:
##
@@ -187,10 +189,11 @@ A **`map`** is a collection of key-value pairs with a key 
type and a value type.
 Notes:
 
 1. Decimal scale is fixed and cannot be changed by schema evolution. Precision 
can only be widened.
-2. All time and timestamp values are stored with microsecond precision.
-- Timestamps _with time zone_ represent a point in time: values are stored 
as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is 
stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered 
identical).
-- Timestamps _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp 
values are stored as a long that encodes microseconds from the unix epoch.
+2. `time`, `timestamp`, and `timestamptz` values are represented with 
_microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are 
represented with _nanosecond precision_.
+- Timestamp values _with time zone_ represent a point in time: values are 
stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` 
is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are 
considered identical).
+- Timestamp values _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables.

Review Comment:
   I like Ryan's suggestion of having a full new column. We can expect more 
types to be added so that a column would make sense to me, instead of adding it 
to the `requirements` column.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]

2023-10-27 Thread via GitHub


amogh-jahagirdar commented on PR #8925:
URL: https://github.com/apache/iceberg/pull/8925#issuecomment-1783496227

   > @amogh-jahagirdar, maybe this time we should create a test to validate 
FileScanTask.split when there are bad split offsets? I think that would have 
caught this in the last PR.
   
   Done, added a new test which will exercise both `splitOffsetArray` and 
`splitOffsets`. This will also validate the number of tasks per file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8925:
URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375034300


##
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##
@@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() {
 "We should get one task per row group", 32, 
Iterables.size(scan.planTasks()));
   }
 
+  @Test
+  public void testSplitPlanningWithCorruptedOffsets() throws IOException {
+List files16Mb = newFilesWithInvalidOffset(16, 16 * 1024 * 1024, 
2);
+appendFiles(files16Mb);
+
+// There will be 4 tasks per file, instead of 2 tasks per file, since the 
offsets are invalid
+// and will be ignored.
+TableScan scan =
+table
+.newScan()
+.option(TableProperties.SPLIT_OPEN_FILE_COST, String.valueOf(0))
+.option(TableProperties.SPLIT_SIZE, String.valueOf(4L * 1024 * 
1024));
+
+Assert.assertEquals(

Review Comment:
   minor: for newly added test code it would be better to use AssertJ-style 
assertions to make a future migration away from JUnit4 easier
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8925:
URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375034993


##
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##
@@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() {
 "We should get one task per row group", 32, 
Iterables.size(scan.planTasks()));
   }
 
+  @Test
+  public void testSplitPlanningWithCorruptedOffsets() throws IOException {
+List files16Mb = newFilesWithInvalidOffset(16, 16 * 1024 * 1024, 
2);
+appendFiles(files16Mb);
+
+// There will be 4 tasks per file, instead of 2 tasks per file, since the 
offsets are invalid
+// and will be ignored.
+TableScan scan =
+table
+.newScan()
+.option(TableProperties.SPLIT_OPEN_FILE_COST, String.valueOf(0))
+.option(TableProperties.SPLIT_SIZE, String.valueOf(4L * 1024 * 
1024));
+
+Assert.assertEquals(
+"We should get four tasks per file since the offsets will be ignored",
+64,
+Iterables.size(scan.planTasks()));
+try (CloseableIterable tasks = scan.planTasks()) {
+  CombinedScanTask task = tasks.iterator().next().asCombinedScanTask();
+  Collection files = task.files();
+  for (FileScanTask fileScanTask : files) {
+DataFile dataFile = fileScanTask.file();
+Assert.assertNull(

Review Comment:
   same as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add latest version to menu [iceberg-docs]

2023-10-27 Thread via GitHub


Fokko merged PR #291:
URL: https://github.com/apache/iceberg-docs/pull/291


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add latest version to menu [iceberg-docs]

2023-10-27 Thread via GitHub


Fokko commented on PR #291:
URL: https://github.com/apache/iceberg-docs/pull/291#issuecomment-1783503359

   Thanks @nastra for fixing this 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8925:
URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375037491


##
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##
@@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() {
 "We should get one task per row group", 32, 
Iterables.size(scan.planTasks()));
   }
 
+  @Test
+  public void testSplitPlanningWithCorruptedOffsets() throws IOException {

Review Comment:
   unfortunately that test passes for me locally without the new fix in 
`BaseFile`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8925:
URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375037491


##
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##
@@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() {
 "We should get one task per row group", 32, 
Iterables.size(scan.planTasks()));
   }
 
+  @Test
+  public void testSplitPlanningWithCorruptedOffsets() throws IOException {

Review Comment:
   unfortunately that test passes for me locally without the new fix in 
`BaseFile` (I woud have expected it to fail without the fix)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8925:
URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375037491


##
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##
@@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() {
 "We should get one task per row group", 32, 
Iterables.size(scan.planTasks()));
   }
 
+  @Test
+  public void testSplitPlanningWithCorruptedOffsets() throws IOException {

Review Comment:
   that test passes for me locally without the new fix in `BaseFile` (I woud 
have expected it to fail without the fix)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Clarify missing fields when writing [iceberg]

2023-10-27 Thread via GitHub


Fokko commented on code in PR #8672:
URL: https://github.com/apache/iceberg/pull/8672#discussion_r1375043993


##
format/spec.md:
##
@@ -128,13 +128,13 @@ Tables do not require rename, except for tables that use 
atomic rename to implem
 
  Writer requirements
 
-Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata 
(including manifests files and manifest lists) files to a table with the given 
version.
 
-| Requirement | Write behavior |
-|-||
-| (blank) | The field should be omitted |
-| _optional_  | The field can be written |
-| _required_  | The field must be written |
+| Requirement | Write behavior  |

Review Comment:
   That's a good point, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Implementation does not write `schema-id` into Manifest Avro headers [iceberg]

2023-10-27 Thread via GitHub


Fokko commented on issue #8745:
URL: https://github.com/apache/iceberg/issues/8745#issuecomment-1783512462

   @JFinis I think this was in there so the schema didn't need to be 
deserialized. 
   
   > Suggestion: Remove mentioning of field completely from the spec. It's 
redundant and the implementation doesn't seem to write it anyway.
   
   WDYT of creating a PR, and also raising this on the dev list?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Spark 3.5: Don't cache or reuse manifest entries while rewriting metadata by default [iceberg]

2023-10-27 Thread via GitHub


aokolnychyi opened a new pull request, #8935:
URL: https://github.com/apache/iceberg/pull/8935

   The action for rewriting manifests caches the manifest entry DF or does an 
extra shuffle in order to skip reading the actual manifest files twice. We did 
this assuming it would increase the performance. However, the caching seems to 
perform poorly for larger tables as it requires substantial cluster resources. 
In addition, doing a round-robin repartition is expensive as the entries must 
be written to disk. The extra write is actually more expensive than the extra 
read required for the range-based shuffle of manifest entries.
   
   Therefore, this PR disables caching by default and removes the optional 
round-robin repartition step. Instead, we will read the manifests twice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Spark 3.5: Use DataFile constants in SparkDataFile [iceberg]

2023-10-27 Thread via GitHub


aokolnychyi opened a new pull request, #8936:
URL: https://github.com/apache/iceberg/pull/8936

   This PR makes `SparkDataFile` use constants in `DataFile` instead of 
hard-coded values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375049907


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they * are 
informational and can be

Review Comment:
   ```suggestion
* cycle. Events with this payload are not consumed by the sink, they are 
informational and can be
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375050867


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they * are 
informational and can be
+ * used by consumers to trigger downstream processes.
+ */
+public class CommitCompletePayload implements Payload {
+
+  private UUID commitId;
+  private Long vtts;

Review Comment:
   the name is somewhat cryptic, would it make sense to rename this to 
`validThroughTimestamp`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375053314


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Event.java:
##
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroEncoderUtil;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.data.avro.DecoderResolver;
+
+/**
+ * Class representing all events produced to the control topic. Different 
event types have different
+ * payloads.
+ */
+public class Event implements Element {
+
+  private UUID id;
+  private EventType type;
+  private Long timestamp;
+  private String groupId;
+  private Payload payload;
+  private final Schema avroSchema;
+
+  public static byte[] encode(Event event) {
+try {
+  return AvroEncoderUtil.encode(event, event.getSchema());
+} catch (IOException e) {
+  throw new UncheckedIOException(e);
+}
+  }
+
+  public static Event decode(byte[] bytes) {
+try {
+  Event event = AvroEncoderUtil.decode(bytes);
+  // clear the cache to avoid memory leak
+  DecoderResolver.clearCache();
+  return event;
+} catch (IOException e) {
+  throw new UncheckedIOException(e);
+}
+  }
+
+  // Used by Avro reflection to instantiate this class when reading events
+  public Event(Schema avroSchema) {
+this.avroSchema = avroSchema;
+  }
+
+  public Event(String groupId, EventType type, Payload payload) {
+this.id = UUID.randomUUID();
+this.type = type;
+this.timestamp = System.currentTimeMillis();
+this.groupId = groupId;
+this.payload = payload;
+
+this.avroSchema =
+SchemaBuilder.builder()
+.record(getClass().getName())
+.fields()
+.name("id")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1500)
+.type(UUID_SCHEMA)
+.noDefault()
+.name("type")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1501)
+.type()
+.intType()
+.noDefault()
+.name("timestamp")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1502)
+.type()
+.longType()
+.noDefault()
+.name("payload")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1503)
+.type(payload.getSchema())
+.noDefault()
+.name("groupId")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1504)
+.type()
+.stringType()
+.noDefault()
+.endRecord();
+  }
+
+  public UUID id() {
+return id;
+  }
+
+  public EventType type() {
+return type;
+  }
+
+  public Long timestamp() {
+return timestamp;
+  }
+
+  public Payload payload() {
+return payload;
+  }
+
+  public String groupId() {
+return groupId;
+  }
+
+  @Override
+  public Schema getSchema() {

Review Comment:
   minor: I believe we typically don't use a `get` prefix in the Iceberg 
codebase



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375054000


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Event.java:
##
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroEncoderUtil;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.data.avro.DecoderResolver;
+
+/**
+ * Class representing all events produced to the control topic. Different 
event types have different
+ * payloads.
+ */
+public class Event implements Element {
+
+  private UUID id;
+  private EventType type;
+  private Long timestamp;
+  private String groupId;
+  private Payload payload;
+  private final Schema avroSchema;
+
+  public static byte[] encode(Event event) {
+try {
+  return AvroEncoderUtil.encode(event, event.getSchema());
+} catch (IOException e) {
+  throw new UncheckedIOException(e);
+}
+  }
+
+  public static Event decode(byte[] bytes) {
+try {
+  Event event = AvroEncoderUtil.decode(bytes);
+  // clear the cache to avoid memory leak
+  DecoderResolver.clearCache();
+  return event;
+} catch (IOException e) {
+  throw new UncheckedIOException(e);
+}
+  }
+
+  // Used by Avro reflection to instantiate this class when reading events
+  public Event(Schema avroSchema) {
+this.avroSchema = avroSchema;
+  }
+
+  public Event(String groupId, EventType type, Payload payload) {
+this.id = UUID.randomUUID();
+this.type = type;
+this.timestamp = System.currentTimeMillis();
+this.groupId = groupId;
+this.payload = payload;
+
+this.avroSchema =
+SchemaBuilder.builder()
+.record(getClass().getName())
+.fields()
+.name("id")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1500)
+.type(UUID_SCHEMA)
+.noDefault()
+.name("type")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1501)
+.type()
+.intType()
+.noDefault()
+.name("timestamp")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1502)
+.type()
+.longType()
+.noDefault()
+.name("payload")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1503)
+.type(payload.getSchema())
+.noDefault()
+.name("groupId")
+.prop(AvroSchemaUtil.FIELD_ID_PROP, 1504)
+.type()
+.stringType()
+.noDefault()
+.endRecord();
+  }
+
+  public UUID id() {
+return id;
+  }
+
+  public EventType type() {
+return type;
+  }
+
+  public Long timestamp() {
+return timestamp;
+  }
+
+  public Payload payload() {
+return payload;
+  }
+
+  public String groupId() {
+return groupId;
+  }
+
+  @Override
+  public Schema getSchema() {

Review Comment:
   I left this one in as it is an override



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375054505


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import static java.util.stream.Collectors.toList;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.avro.util.Utf8;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class TableName implements Element {

Review Comment:
   would `Identifier` or `FullIdentifier` make sense here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375054505


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import static java.util.stream.Collectors.toList;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.avro.util.Utf8;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class TableName implements Element {

Review Comment:
   nit: would `FullTableIdentifier` / `FullIdentifier` make sense here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

2023-10-27 Thread via GitHub


stevenzwu commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1373440573


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -504,6 +508,27 @@ private static Map 
toReadableByteBufferMap(Map Map filterColumnsStats(
+  Map map, Set columnIds) {
+if (columnIds == null || columnIds.isEmpty()) {
+  return SerializableMap.copyOf(map);
+}
+
+if (map == null) {
+  return null;
+}
+
+Map filtered = 
Maps.newHashMapWithExpectedSize(columnIds.size());
+for (Integer columnId : columnIds) {
+  TypeT value = map.get(columnId);
+  if (value != null) {

Review Comment:
   thanks for the unit test code. I am still not comfortable of making the 
assumption as the Iceberg spec clearly said null stats value is possible. what 
if Trino parquet writer produces min/max stats with null value.
   ```
   Lower bound for the non-null, non-NaN values in the partition field, or null 
if all values are null or NaN [2]
   ```
   
   As least, there is a mismatch in the Parquet writer implementation and the 
spec. Unless we are going to clarify/update the spec, I feel it is safer to do 
`map.containsKey(key)`.
   
   Maybe @RussellSpitzer and @rdblue can chime in here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1375068537


##
aws/src/integration/java/org/apache/iceberg/aws/lakeformation/LakeFormationTestBase.java:
##
@@ -357,8 +360,20 @@ String getRandomTableName() {
 return LF_TEST_TABLE_PREFIX + UUID.randomUUID().toString().replace("-", 
"");
   }
 
-  private static void waitForIamConsistency() throws Exception {
-Thread.sleep(IAM_PROPAGATION_DELAY); // sleep to make sure IAM up to date
+  private static void waitForIamConsistency(String roleName, String 
policyName) {
+// wait to make sure IAM up to date
+Awaitility.await()
+.pollDelay(Duration.ofSeconds(1))
+.atMost(Duration.ofSeconds(10))
+.until(

Review Comment:
   this should do the same `untilAsserted()` check that I mentioned earlier



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1375070758


##
aws/src/integration/java/org/apache/iceberg/aws/lakeformation/LakeFormationTestBase.java:
##
@@ -417,7 +432,20 @@ private static void registerResource(String s3Location) {
   .build());
   // when a resource is registered, LF will update SLR with necessary 
permissions which has a
   // propagation delay
-  waitForIamConsistency();
+  Awaitility.await()
+  .pollDelay(Duration.ofSeconds(1))
+  .atMost(Duration.ofSeconds(10))
+  .ignoreExceptions()
+  .untilAsserted(
+  () ->
+  Assertions.assertThat(
+  lakeformation
+  .describeResource(
+  
DescribeResourceRequest.builder().resourceArn(arn).build())
+  .resourceInfo()
+  .roleArn()
+  .equalsIgnoreCase(lfRegisterPathRoleArn))

Review Comment:
   this should do `.isEqualToIgnoringCase(lfRegisterPathRoleArn))` rather than 
`isTrue()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1375070935


##
aws/src/integration/java/org/apache/iceberg/aws/lakeformation/TestLakeFormationAwsClientFactory.java:
##
@@ -128,8 +131,18 @@ public void testLakeFormationEnabledGlueCatalog() throws 
Exception {
 + glueArnPrefix
 + ":userDefinedFunction/allowed_*/*\"]}]}")
 .build());
-waitForIamConsistency();
-
+Awaitility.await()
+.pollDelay(Duration.ofSeconds(1))
+.atMost(Duration.ofSeconds(10))
+.until(

Review Comment:
   this should be consistent with other checks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375071931


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they * are 
informational and can be
+ * used by consumers to trigger downstream processes.
+ */
+public class CommitCompletePayload implements Payload {
+
+  private UUID commitId;
+  private Long vtts;

Review Comment:
   Sure, I updated this to `validThroughTs`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8804:
URL: https://github.com/apache/iceberg/pull/8804#discussion_r1375072145


##
core/src/test/java/org/apache/iceberg/hadoop/TestHadoopCommits.java:
##
@@ -435,13 +437,11 @@ public void testConcurrentFastAppends(@TempDir File dir) 
throws Exception {
   for (int numCommittedFiles = 0;
   numCommittedFiles < numberOfCommitedFilesPerThread;
   numCommittedFiles++) {
-while (barrier.get() < numCommittedFiles * threadsCount) {
-  try {
-Thread.sleep(10);
-  } catch (InterruptedException e) {
-throw new RuntimeException(e);
-  }
-}
+final int currentFilesCount = numCommittedFiles;
+Awaitility.await()
+.pollInterval(Duration.ofMillis(10))
+.pollInSameThread()

Review Comment:
   then we need to figure out why the test was flaky. Was it occassionally 
failing? The purpose of using Awaitility is to make tests more robust and less 
flaky, so we need to figure out why this is flaky



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375073000


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they * are 
informational and can be

Review Comment:
   Thanks for catching this, I fix this



##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they * are 
informational and can be

Review Comment:
   Thanks for catching this, I fixed this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1375074020


##
hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreSetup.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
+import org.apache.iceberg.CatalogProperties;
+import org.apache.iceberg.CatalogUtil;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+
+/**
+ * Setup HiveMetastore. It does not create any database. All the tests should 
create a database
+ * accordingly. it should replace the existing setUp class {@link 
HiveMetastoreTest}
+ */
+class HiveMetastoreSetup {

Review Comment:
   the conclusion here is that we should remove this class. As I mentioned in 
another comment, the alternative would be to have a JUnit5 extension (similar 
to `NessieJaxRsExtension`) that would to the proper setup/teardown for tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1375074725


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java:
##
@@ -96,6 +105,43 @@ public class TestHiveCatalog extends HiveMetastoreTest {
 
   @TempDir private Path temp;
 
+  private HiveMetastoreSetup hiveMetastoreSetup;
+
+  private static final String HIVE_DB_NAME = "hivedb";

Review Comment:
   why not just call this `DB_NAME` so that you don't have to adjust all of the 
places?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1375075096


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java:
##
@@ -349,26 +396,31 @@ public void testCreateTableCustomSortOrder() throws 
Exception {
   assertThat(hmsTableParameters())
   .containsEntry(DEFAULT_SORT_ORDER, 
SortOrderParser.toJson(table.sortOrder()));
 } finally {
-  catalog.dropTable(tableIdent);
+  catalog().dropTable(tableIdent);
 }
   }
 
+  @Override
+  protected HiveCatalog catalog() {

Review Comment:
   this should be at the top



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1375077284


##
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java:
##
@@ -850,51 +857,46 @@ private void removeNamespaceOwnershipAndVerify(
 createNamespaceAndVerifyOwnership(
 name, propToCreate, expectedOwnerPostCreate, 
expectedOwnerTypePostCreate);
 
-catalog.removeProperties(Namespace.of(name), propToRemove);
+catalog().removeProperties(Namespace.of(name), propToRemove);
 
-Database database = metastoreClient.getDatabase(name);
+Database database = hiveMetastoreSetup.metastoreClient.getDatabase(name);
 
 assertThat(database.getOwnerName()).isEqualTo(expectedOwnerPostRemove);
 assertThat(database.getOwnerType()).isEqualTo(expectedOwnerTypePostRemove);
   }
 
   @Test
-  public void testDropNamespace() throws TException {
+  public void testDropEmptyNamespace() throws TException {

Review Comment:
   can we please leave refactorings out of this PR? THis file should only have 
a minimal set of changes (by adding the setup/teardown stuff(



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375077725


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import static java.util.stream.Collectors.toList;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.avro.util.Utf8;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class TableName implements Element {

Review Comment:
   `FullTableName` or `TableIdentity` maybe? Having `Table` in the name would 
be helpful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375084999


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import static java.util.stream.Collectors.toList;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.avro.util.Utf8;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class TableName implements Element {

Review Comment:
   `FullTableName` sounds like a good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375085697


##
nessie/src/main/java/org/apache/iceberg/nessie/UpdateableReference.java:
##
@@ -62,7 +62,7 @@ public Reference getReference() {
 
   public void checkMutable() {
 Preconditions.checkArgument(
-mutable, "You can only mutate tables when using a branch without a 
hash or timestamp.");
+mutable, "You can only mutate contents when using a branch without a 
hash or timestamp.");

Review Comment:
   nit: should this say `content` maybe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375086307


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java:
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.nessie;
+
+import java.util.Map;
+import java.util.concurrent.atomic.AtomicBoolean;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.iceberg.view.ViewMetadataParser;
+import org.projectnessie.client.http.HttpClientException;
+import org.projectnessie.error.NessieBadRequestException;
+import org.projectnessie.error.NessieConflictException;
+import org.projectnessie.error.NessieNotFoundException;
+import org.projectnessie.model.Content;
+import org.projectnessie.model.ContentKey;
+import org.projectnessie.model.IcebergTable;
+import org.projectnessie.model.IcebergView;
+import org.projectnessie.model.Reference;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class NessieViewOperations extends BaseViewOperations {

Review Comment:
   minor: does this need to be public?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375087365


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java:
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.nessie;
+
+import java.util.Map;
+import java.util.concurrent.atomic.AtomicBoolean;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.iceberg.view.ViewMetadataParser;
+import org.projectnessie.client.http.HttpClientException;
+import org.projectnessie.error.NessieBadRequestException;
+import org.projectnessie.error.NessieConflictException;
+import org.projectnessie.error.NessieNotFoundException;
+import org.projectnessie.model.Content;
+import org.projectnessie.model.ContentKey;
+import org.projectnessie.model.IcebergTable;
+import org.projectnessie.model.IcebergView;
+import org.projectnessie.model.Reference;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class NessieViewOperations extends BaseViewOperations {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(NessieViewOperations.class);
+
+  private final NessieIcebergClient client;
+

Review Comment:
   nit: line can be removed so that all final fields are grouped together



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375087754


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import static java.util.stream.Collectors.toList;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.avro.util.Utf8;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class TableName implements Element {

Review Comment:
   I made this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375088374


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java:
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.nessie;
+
+import java.util.Map;
+import java.util.concurrent.atomic.AtomicBoolean;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.iceberg.view.ViewMetadataParser;
+import org.projectnessie.client.http.HttpClientException;
+import org.projectnessie.error.NessieBadRequestException;
+import org.projectnessie.error.NessieConflictException;
+import org.projectnessie.error.NessieNotFoundException;
+import org.projectnessie.model.Content;
+import org.projectnessie.model.ContentKey;
+import org.projectnessie.model.IcebergTable;
+import org.projectnessie.model.IcebergView;
+import org.projectnessie.model.Reference;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class NessieViewOperations extends BaseViewOperations {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(NessieViewOperations.class);
+
+  private final NessieIcebergClient client;
+
+  private final ContentKey key;
+  private final FileIO fileIO;
+  private final Map catalogOptions;
+  private IcebergView icebergView;
+
+  NessieViewOperations(
+  ContentKey key,
+  NessieIcebergClient client,
+  FileIO fileIO,
+  Map catalogOptions) {
+this.key = key;
+this.client = client;
+this.fileIO = fileIO;
+this.catalogOptions = catalogOptions;
+  }
+
+  @Override
+  public void doRefresh() {
+try {
+  client.refresh();
+} catch (NessieNotFoundException e) {
+  throw new RuntimeException(
+  String.format(
+  "Failed to refresh as ref '%s' is no longer valid.", 
client.getRef().getName()),
+  e);
+}
+String metadataLocation = null;
+Reference reference = client.getRef().getReference();
+try {
+  Content content = 
client.getApi().getContent().key(key).reference(reference).get().get(key);
+  LOG.debug("Content '{}' at '{}': {}", key, reference, content);
+  if (content == null) {
+if (currentMetadataLocation() != null) {
+  throw new NoSuchViewException("View does not exist: %s in %s", key, 
reference);
+}
+  } else {
+this.icebergView =
+content
+.unwrap(IcebergView.class)
+.orElseThrow(
+() -> {
+  if (content instanceof IcebergTable) {
+return new AlreadyExistsException(
+"Table with same name already exists: %s in %s", 
key, reference);
+  } else {
+return new IllegalStateException(
+String.format(
+"Cannot refresh Iceberg view: Nessie points to 
a non-Iceberg object for path: %s.",
+key));
+  }
+});
+metadataLocation = icebergView.getMetadataLocation();
+  }
+} catch (NessieNotFoundException ex) {
+  if (currentMetadataLocation() != null) {
+throw new NoSuchViewException("View does not exist: %s in %s", key, 
reference);
+  }
+}
+refreshFromMetadataLocation(metadataLocation, null, 2, l -> 
loadViewMetadata(l, reference));
+  }
+
+  private ViewMetadata loadViewMetadata(String metadataLocation, Reference 
reference) {
+ViewMetadata metadata = 
ViewMetadataParser.read(io().newInputFile(metadataLocation));
+Map newProperties = Maps.newHashMap(metadata.properties());
+newProperties.put(NessieTableOperations.NESSIE_COMMIT_ID_PROPERTY, 
reference.getHash());
+
+return ViewMetadata.buildFrom(
+
ViewMetadata.buildFrom(metadata).setProperties(newProperties).build())
+   

Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375089472


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java:
##
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.nessie;
+
+import java.util.Map;
+import java.util.concurrent.atomic.AtomicBoolean;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.iceberg.view.ViewMetadataParser;
+import org.projectnessie.client.http.HttpClientException;
+import org.projectnessie.error.NessieBadRequestException;
+import org.projectnessie.error.NessieConflictException;
+import org.projectnessie.error.NessieNotFoundException;
+import org.projectnessie.model.Content;
+import org.projectnessie.model.ContentKey;
+import org.projectnessie.model.IcebergTable;
+import org.projectnessie.model.IcebergView;
+import org.projectnessie.model.Reference;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class NessieViewOperations extends BaseViewOperations {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(NessieViewOperations.class);
+
+  private final NessieIcebergClient client;
+
+  private final ContentKey key;
+  private final FileIO fileIO;
+  private final Map catalogOptions;
+  private IcebergView icebergView;
+
+  NessieViewOperations(
+  ContentKey key,
+  NessieIcebergClient client,
+  FileIO fileIO,
+  Map catalogOptions) {
+this.key = key;
+this.client = client;
+this.fileIO = fileIO;
+this.catalogOptions = catalogOptions;
+  }
+
+  @Override
+  public void doRefresh() {
+try {
+  client.refresh();
+} catch (NessieNotFoundException e) {
+  throw new RuntimeException(
+  String.format(
+  "Failed to refresh as ref '%s' is no longer valid.", 
client.getRef().getName()),
+  e);
+}
+String metadataLocation = null;
+Reference reference = client.getRef().getReference();
+try {
+  Content content = 
client.getApi().getContent().key(key).reference(reference).get().get(key);
+  LOG.debug("Content '{}' at '{}': {}", key, reference, content);
+  if (content == null) {
+if (currentMetadataLocation() != null) {
+  throw new NoSuchViewException("View does not exist: %s in %s", key, 
reference);
+}
+  } else {
+this.icebergView =
+content
+.unwrap(IcebergView.class)
+.orElseThrow(
+() -> {
+  if (content instanceof IcebergTable) {
+return new AlreadyExistsException(
+"Table with same name already exists: %s in %s", 
key, reference);
+  } else {
+return new IllegalStateException(
+String.format(
+"Cannot refresh Iceberg view: Nessie points to 
a non-Iceberg object for path: %s.",
+key));
+  }
+});
+metadataLocation = icebergView.getMetadataLocation();
+  }
+} catch (NessieNotFoundException ex) {
+  if (currentMetadataLocation() != null) {
+throw new NoSuchViewException("View does not exist: %s in %s", key, 
reference);
+  }
+}
+refreshFromMetadataLocation(metadataLocation, null, 2, l -> 
loadViewMetadata(l, reference));
+  }
+
+  private ViewMetadata loadViewMetadata(String metadataLocation, Reference 
reference) {
+ViewMetadata metadata = 
ViewMetadataParser.read(io().newInputFile(metadataLocation));
+Map newProperties = Maps.newHashMap(metadata.properties());
+newProperties.put(NessieTableOperations.NESSIE_COMMIT_ID_PROPERTY, 
reference.getHash());
+
+return ViewMetadata.buildFrom(
+
ViewMetadata.buildFrom(metadata).setProperties(newProperties).build())
+   

Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375090864


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java:
##
@@ -347,4 +348,54 @@ private TableIdentifier identifierWithoutTableReference(
   protected Map properties() {
 return catalogOptions;
   }
+
+  @Override
+  protected ViewOperations newViewOps(TableIdentifier identifier) {
+TableReference tr = parseTableReference(identifier);
+return new NessieViewOperations(
+ContentKey.of(
+
org.projectnessie.model.Namespace.of(identifier.namespace().levels()), 
tr.getName()),
+client.withReference(tr.getReference(), tr.getHash()),
+fileIO,
+catalogOptions);
+  }
+
+  @Override
+  public List listViews(Namespace namespace) {
+return client.listViews(namespace);
+  }
+
+  @Override
+  public boolean dropView(TableIdentifier identifier) {
+TableReference tableReference = parseTableReference(identifier);
+return client
+.withReference(tableReference.getReference(), tableReference.getHash())
+.dropView(identifierWithoutTableReference(identifier, tableReference), 
false);
+  }
+
+  @Override
+  public void renameView(TableIdentifier from, TableIdentifier to) {
+TableReference fromTableReference = parseTableReference(from);
+TableReference toTableReference = parseTableReference(to);
+String fromReference =
+fromTableReference.hasReference()
+? fromTableReference.getReference()
+: client.getRef().getName();
+String toReference =
+toTableReference.hasReference()
+? toTableReference.getReference()
+: client.getRef().getName();
+Preconditions.checkArgument(
+fromReference.equalsIgnoreCase(toReference),
+"from: %s and to: %s reference name must be same",

Review Comment:
   I think it would be good to make this error msg clearer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375094428


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java:
##
@@ -347,4 +348,54 @@ private TableIdentifier identifierWithoutTableReference(
   protected Map properties() {
 return catalogOptions;
   }
+
+  @Override
+  protected ViewOperations newViewOps(TableIdentifier identifier) {
+TableReference tr = parseTableReference(identifier);
+return new NessieViewOperations(
+ContentKey.of(
+
org.projectnessie.model.Namespace.of(identifier.namespace().levels()), 
tr.getName()),
+client.withReference(tr.getReference(), tr.getHash()),
+fileIO,
+catalogOptions);
+  }
+
+  @Override
+  public List listViews(Namespace namespace) {
+return client.listViews(namespace);
+  }
+
+  @Override
+  public boolean dropView(TableIdentifier identifier) {
+TableReference tableReference = parseTableReference(identifier);
+return client
+.withReference(tableReference.getReference(), tableReference.getHash())
+.dropView(identifierWithoutTableReference(identifier, tableReference), 
false);
+  }
+
+  @Override
+  public void renameView(TableIdentifier from, TableIdentifier to) {
+TableReference fromTableReference = parseTableReference(from);
+TableReference toTableReference = parseTableReference(to);
+String fromReference =
+fromTableReference.hasReference()
+? fromTableReference.getReference()
+: client.getRef().getName();
+String toReference =
+toTableReference.hasReference()
+? toTableReference.getReference()
+: client.getRef().getName();
+Preconditions.checkArgument(
+fromReference.equalsIgnoreCase(toReference),
+"from: %s and to: %s reference name must be same",
+fromReference,
+toReference);
+
+client
+.withReference(fromTableReference.getReference(), 
fromTableReference.getHash())
+.renameView(
+identifierWithoutTableReference(from, fromTableReference),
+NessieUtil.removeCatalogName(

Review Comment:
   any particular reason why `NessieUtil.removeCatalogName(..)` is only called 
on the `to` ref and not on the `from` ref?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375094889


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -136,15 +143,23 @@ private UpdateableReference loadReference(String 
requestedRef, String hash) {
   }
 
   public List listTables(Namespace namespace) {
+return listContents(namespace, Content.Type.ICEBERG_TABLE);
+  }
+
+  public List listViews(Namespace namespace) {
+return listContents(namespace, Content.Type.ICEBERG_VIEW);
+  }
+
+  private List listContents(Namespace namespace, Content.Type 
type) {

Review Comment:
   ```suggestion
 private List listContent(Namespace namespace, 
Content.Type type) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375097325


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -355,21 +384,34 @@ public void renameTable(TableIdentifier from, 
TableIdentifier to) {
   // and removed by another.
   throw new RuntimeException(
   String.format(
-  "Cannot rename table '%s' to '%s': " + "ref '%s' no longer 
exists.",
-  from.name(), to.name(), getRef().getName()),
+  "Cannot rename %s '%s' to '%s': ref '%s' no longer exists.",
+  contentType, from, to, getRef().getName()),
   e);
 } catch (BaseNessieClientServerException e) {
   throw new CommitFailedException(
   e,
-  "Cannot rename table '%s' to '%s': " + "the current reference is not 
up to date.",
-  from.name(),
-  to.name());
+  "Cannot rename %s '%s' to '%s': the current reference is not up to 
date.",
+  contentType,
+  from,
+  to);
 } catch (HttpClientException ex) {
   // Intentionally catch all nessie-client-exceptions here and not just 
the "timeout" variant
   // to catch all kinds of network errors (e.g. connection reset). Network 
code implementation
   // details and all kinds of network devices can induce unexpected 
behavior. So better be
   // safe than sorry.
   throw new CommitStateUnknownException(ex);
+} catch (NessieBadRequestException ex) {
+  if (ex.getMessage().contains("already exists with content ID")) {

Review Comment:
   same question: Is this an error msg that can be relied upon?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375097118


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -355,21 +384,34 @@ public void renameTable(TableIdentifier from, 
TableIdentifier to) {
   // and removed by another.
   throw new RuntimeException(
   String.format(
-  "Cannot rename table '%s' to '%s': " + "ref '%s' no longer 
exists.",
-  from.name(), to.name(), getRef().getName()),
+  "Cannot rename %s '%s' to '%s': ref '%s' no longer exists.",
+  contentType, from, to, getRef().getName()),
   e);
 } catch (BaseNessieClientServerException e) {
   throw new CommitFailedException(
   e,
-  "Cannot rename table '%s' to '%s': " + "the current reference is not 
up to date.",
-  from.name(),
-  to.name());
+  "Cannot rename %s '%s' to '%s': the current reference is not up to 
date.",
+  contentType,
+  from,
+  to);
 } catch (HttpClientException ex) {
   // Intentionally catch all nessie-client-exceptions here and not just 
the "timeout" variant
   // to catch all kinds of network errors (e.g. connection reset). Network 
code implementation
   // details and all kinds of network devices can induce unexpected 
behavior. So better be
   // safe than sorry.
   throw new CommitStateUnknownException(ex);
+} catch (NessieBadRequestException ex) {
+  if (ex.getMessage().contains("already exists with content ID")) {
+if (isView) {

Review Comment:
   I think something like this would be better here:
   ```
   String errorMsg = isView ? "." : ".."
   throw new AlreadyExistsException(errorMsg)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375097962


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -378,27 +420,63 @@ public void renameTable(TableIdentifier from, 
TableIdentifier to) {
 // behavior. So better be safe than sorry.
   }
 
+  private void validateContentForRename(
+  TableIdentifier from,
+  TableIdentifier to,
+  boolean isView,
+  IcebergContent existingFromContent) {
+if (existingFromContent == null) {
+  if (isView) {
+throw new NoSuchViewException("View does not exist: %s", from);
+  } else {
+throw new NoSuchTableException("Table does not exist: %s", from);
+  }
+}
+
+IcebergContent existingToContent = isView ? view(to) : table(to);
+if (existingToContent != null) {
+  if (isView) {
+throw new AlreadyExistsException("Cannot rename %s to %s. View already 
exists", from, to);
+  } else {
+throw new AlreadyExistsException("Cannot rename %s to %s. Table 
already exists", from, to);
+  }
+}
+  }
+
   public boolean dropTable(TableIdentifier identifier, boolean purge) {
+return dropContent(identifier, purge, false);
+  }
+
+  public boolean dropView(TableIdentifier identifier, boolean purge) {
+return dropContent(identifier, purge, true);
+  }
+
+  private boolean dropContent(TableIdentifier identifier, boolean purge, 
boolean isView) {

Review Comment:
   rather than passing a boolean flag, maybe just pass the content type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375098386


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java:
##
@@ -540,4 +630,72 @@ public void close() {
   api.close();
 }
   }
+
+  public void commitView(
+  ViewMetadata base,
+  ViewMetadata metadata,
+  String newMetadataLocation,
+  String contentId,
+  ContentKey key)
+  throws NessieConflictException, NessieNotFoundException {
+UpdateableReference updateableReference = getRef();
+
+updateableReference.checkMutable();
+
+Branch current = (Branch) updateableReference.getReference();
+Branch expectedHead = current;
+if (base != null) {
+  String metadataCommitId =
+  base.properties()
+  .getOrDefault(
+  NessieTableOperations.NESSIE_COMMIT_ID_PROPERTY, 
expectedHead.getHash());
+  if (metadataCommitId != null) {
+expectedHead = Branch.of(expectedHead.getName(), metadataCommitId);
+  }
+}
+
+long versionId = metadata.currentVersion().versionId();
+
+ImmutableIcebergView.Builder newViewBuilder = 
ImmutableIcebergView.builder();
+// Directly casting to `SQLViewRepresentation` as only SQL type exist in
+// `ViewRepresentation.Type`.
+// Assuming only one engine's dialect will be used, Nessie IcebergView 
currently holds one
+// representation.
+// View loaded from catalog will have all the representation as it parses 
the view metadata
+// file.
+SQLViewRepresentation sqlViewRepresentation =
+(SQLViewRepresentation) 
metadata.currentVersion().representations().get(0);

Review Comment:
   it seems weird to only store one and not all of the representations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-10-27 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375098623


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java:
##
@@ -135,71 +135,26 @@ protected void doCommit(TableMetadata base, TableMetadata 
metadata) {
 boolean newTable = base == null;
 String newMetadataLocation = writeNewMetadataIfRequired(newTable, 
metadata);
 
-String refName = client.refName();
-boolean failure = false;
+AtomicBoolean failure = new AtomicBoolean(false);

Review Comment:
   why does this need to be `AtomicBoolean`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


danielcweeks commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375098695


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitReadyPayload.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.List;
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a worker that indicates it has 
finished sending all
+ * data for a commit request.
+ */
+public class CommitReadyPayload implements Payload {
+
+  private UUID commitId;
+  private List assignments;
+  private final Schema avroSchema;
+
+  private static final Schema AVRO_SCHEMA =
+  SchemaBuilder.builder()
+  .record(CommitReadyPayload.class.getName())
+  .fields()
+  .name("commitId")
+  .prop(AvroSchemaUtil.FIELD_ID_PROP, 1100)

Review Comment:
   I notice that we're using a wide range of field identifier values for the 
same fields.  Is this because we're trying to deconflict with other 
Payloads/Elements (e.g. 1100, 1200, 1300)?  I can see the need if the avro 
schema embeds another element (like `TableName`).  Just wondering if there's 
anything we would really need to be concerned about by having a more 
explicit/consistent mapping (e.g. `FIELD_ID_PROP = 1000` everywhere).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375099924


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitReadyPayload.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.List;
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a worker that indicates it has 
finished sending all
+ * data for a commit request.
+ */
+public class CommitReadyPayload implements Payload {
+
+  private UUID commitId;
+  private List assignments;
+  private final Schema avroSchema;
+
+  private static final Schema AVRO_SCHEMA =
+  SchemaBuilder.builder()
+  .record(CommitReadyPayload.class.getName())
+  .fields()
+  .name("commitId")
+  .prop(AvroSchemaUtil.FIELD_ID_PROP, 1100)

Review Comment:
   I originally had it that way, with the same prop ID everywhere, as it isn't 
used for anything. I changed this to have unique field IDs to address one 
comment that having real field IDs would be helpful. I'm OK either way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


danielcweeks commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375100608


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import static java.util.stream.Collectors.toList;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.avro.util.Utf8;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class TableName implements Element {

Review Comment:
   I'm a little late to the party here, but I like `TableReference` which I 
don't believe is currently used anywhere.  Also, I think we might want to 
include the catalog name as well.  That I feel could be helpful with the 
CommitCompletePayload especially since it can help disambiguate where commits 
are coming from if there are multiple catalogs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Honor Spark conf spark.sql.files.maxPartitionBytes in read split [iceberg]

2023-10-27 Thread via GitHub


jzhuge commented on PR #8922:
URL: https://github.com/apache/iceberg/pull/8922#issuecomment-1783584994

   The PR is ready for review.
   
   If approved, we will follow up with doc update and backports to 3.4, 3.3, 
etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375104748


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import static java.util.stream.Collectors.toList;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.avro.util.Utf8;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class TableName implements Element {

Review Comment:
   Yes, `TableReference` is great, I made this change. Currently the sink 
(which hasn't been submitted) only supports a single catalog, but I went ahead 
and made this change also.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


danielcweeks commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375105601


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they are 
informational and can be
+ * used by consumers to trigger downstream processes.
+ */
+public class CommitCompletePayload implements Payload {
+
+  private UUID commitId;
+  private Long validThroughTs;
+  private final Schema avroSchema;

Review Comment:
   Should we include any additional information in this (like a List of 
`TableName`s that were committed?).  I think for this to be useful for 
downstream consumers, it has surprisingly little info.  This might also be 
helpful because not all tables may get commits, right?
   
   Other information that could be useful (if available) would be stats (e.g. 
how may records were added.  How many tables were updated).  We can always add 
more to this, so think maybe the list of tables would be a good starting point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


bryanck commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375106832


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they are 
informational and can be
+ * used by consumers to trigger downstream processes.
+ */
+public class CommitCompletePayload implements Payload {
+
+  private UUID commitId;
+  private Long validThroughTs;
+  private final Schema avroSchema;

Review Comment:
   There is a separate message that is sent for each individual table commit, 
this message is sent after all of the individual table commit messages have 
been sent, as a sort of terminator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


danielcweeks commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375106869


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they are 
informational and can be
+ * used by consumers to trigger downstream processes.
+ */
+public class CommitCompletePayload implements Payload {
+
+  private UUID commitId;
+  private Long validThroughTs;
+  private final Schema avroSchema;

Review Comment:
   Never mind, it looks like that info is part of `CommitTablePayload`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


danielcweeks commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375108167


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java:
##
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import static java.util.stream.Collectors.toList;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.avro.util.Utf8;
+import org.apache.iceberg.catalog.Namespace;
+import org.apache.iceberg.catalog.TableIdentifier;
+
+public class TableName implements Element {

Review Comment:
   I was just thinking that there may be multiple syncs running for different 
catalogs and if they want something to process those events across multiple 
topics, it really helps to have the additional context.  Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


danielcweeks commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375108603


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitReadyPayload.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.List;
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a worker that indicates it has 
finished sending all
+ * data for a commit request.
+ */
+public class CommitReadyPayload implements Payload {
+
+  private UUID commitId;
+  private List assignments;
+  private final Schema avroSchema;
+
+  private static final Schema AVRO_SCHEMA =
+  SchemaBuilder.builder()
+  .record(CommitReadyPayload.class.getName())
+  .fields()
+  .name("commitId")
+  .prop(AvroSchemaUtil.FIELD_ID_PROP, 1100)

Review Comment:
   I'm fine with it, just wasn't sure if there was a reason behind the 
numbering.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


danielcweeks commented on PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#issuecomment-1783596341

   I'm a +1 on moving forward with this.  I think there might still be an open 
question about Iceberg/Avro Schema definitions, but I'm fine with either 
resolution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]

2023-10-27 Thread via GitHub


rdblue commented on code in PR #8701:
URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375115365


##
kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java:
##
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.connect.events;
+
+import java.util.UUID;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaBuilder;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+
+/**
+ * A control event payload for events sent by a coordinator that indicates it 
has completed a commit
+ * cycle. Events with this payload are not consumed by the sink, they are 
informational and can be
+ * used by consumers to trigger downstream processes.
+ */
+public class CommitCompletePayload implements Payload {
+
+  private UUID commitId;
+  private Long validThroughTs;
+  private final Schema avroSchema;
+
+  private static final Schema AVRO_SCHEMA =
+  SchemaBuilder.builder()
+  .record(CommitCompletePayload.class.getName())
+  .fields()
+  .name("commitId")
+  .prop(AvroSchemaUtil.FIELD_ID_PROP, 1000)

Review Comment:
   ID 1000 is used for partition fields. Any time those are embedded in a 
data_file the partition should use those IDs. Might want to start these values 
at 10,000 instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]

2023-10-27 Thread via GitHub


amogh-jahagirdar commented on code in PR #8925:
URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375117854


##
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##
@@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() {
 "We should get one task per row group", 32, 
Iterables.size(scan.planTasks()));
   }
 
+  @Test
+  public void testSplitPlanningWithCorruptedOffsets() throws IOException {

Review Comment:
   Discussed offline, this test exercises the path but has poor test hardness 
since the offsets end up being null anyways.
   For proper test hardness we can go through TableScanUtil.planTaskGroups 
which will call FileScanTask.split and exercise the split offset array path. 
The new test will fail without the fix since the number of tasks will be 
different (without the fix, there will be 2 tasks, 1 per split offset but the 
last task is a bad task. With the fix, the split size of 1 byte will be used 
for the 10 byte file, and there will be 10 tasks)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]

2023-10-27 Thread via GitHub


amogh-jahagirdar commented on code in PR #8925:
URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375118142


##
core/src/test/java/org/apache/iceberg/TestSplitPlanning.java:
##
@@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() {
 "We should get one task per row group", 32, 
Iterables.size(scan.planTasks()));
   }
 
+  @Test
+  public void testSplitPlanningWithCorruptedOffsets() throws IOException {
+List files16Mb = newFilesWithInvalidOffset(16, 16 * 1024 * 1024, 
2);
+appendFiles(files16Mb);
+
+// There will be 4 tasks per file, instead of 2 tasks per file, since the 
offsets are invalid
+// and will be ignored.
+TableScan scan =
+table
+.newScan()
+.option(TableProperties.SPLIT_OPEN_FILE_COST, String.valueOf(0))
+.option(TableProperties.SPLIT_SIZE, String.valueOf(4L * 1024 * 
1024));
+
+Assert.assertEquals(

Review Comment:
   Done, this test is no longer applicable but for the new test it's using 
AssertJ



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-27 Thread via GitHub


jacobmarble commented on code in PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#discussion_r1375124724


##
format/spec.md:
##
@@ -187,10 +189,11 @@ A **`map`** is a collection of key-value pairs with a key 
type and a value type.
 Notes:
 
 1. Decimal scale is fixed and cannot be changed by schema evolution. Precision 
can only be widened.
-2. All time and timestamp values are stored with microsecond precision.
-- Timestamps _with time zone_ represent a point in time: values are stored 
as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is 
stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered 
identical).
-- Timestamps _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp 
values are stored as a long that encodes microseconds from the unix epoch.
+2. `time`, `timestamp`, and `timestamptz` values are represented with 
_microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are 
represented with _nanosecond precision_.
+- Timestamp values _with time zone_ represent a point in time: values are 
stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` 
is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are 
considered identical).
+- Timestamp values _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables.

Review Comment:
   Ah, I missed that detail. How's this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-27 Thread via GitHub


jacobmarble commented on code in PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#discussion_r1375124724


##
format/spec.md:
##
@@ -187,10 +189,11 @@ A **`map`** is a collection of key-value pairs with a key 
type and a value type.
 Notes:
 
 1. Decimal scale is fixed and cannot be changed by schema evolution. Precision 
can only be widened.
-2. All time and timestamp values are stored with microsecond precision.
-- Timestamps _with time zone_ represent a point in time: values are stored 
as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is 
stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered 
identical).
-- Timestamps _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp 
values are stored as a long that encodes microseconds from the unix epoch.
+2. `time`, `timestamp`, and `timestamptz` values are represented with 
_microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are 
represented with _nanosecond precision_.
+- Timestamp values _with time zone_ represent a point in time: values are 
stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` 
is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are 
considered identical).
+- Timestamp values _without time zone_ represent a date and time of day 
regardless of zone: the time value is independent of zone adjustments 
(`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`).
 3. Character strings must be stored as UTF-8 encoded byte arrays.
+4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables.

Review Comment:
   Ah, I missed that detail. How's this? 
https://github.com/apache/iceberg/pull/8683/commits/e81df5543b82ac30dd4bb7f38b22e58f43f71a51



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] HadoopCatalog can't list toplevel tables [iceberg]

2023-10-27 Thread via GitHub


github-actions[bot] commented on issue #7130:
URL: https://github.com/apache/iceberg/issues/7130#issuecomment-1783628604

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] HadoopCatalog can't list toplevel tables [iceberg]

2023-10-27 Thread via GitHub


github-actions[bot] closed issue #7130: HadoopCatalog can't list toplevel tables
URL: https://github.com/apache/iceberg/issues/7130


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Unable to use GlueCatalog in flink environments without hadoop [iceberg]

2023-10-27 Thread via GitHub


rajcoolguy commented on issue #3044:
URL: https://github.com/apache/iceberg/issues/3044#issuecomment-1783634807

   @mgmarino - how did you make this work in KDA, lacking documentation, i am 
either landing in to linkage error due to hadoop jars though i have shaded and 
relocated hadoop classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] patch: Parquet Column Names with "Special Characters" fix [iceberg-python]

2023-10-27 Thread via GitHub


MarquisC commented on PR #109:
URL: https://github.com/apache/iceberg-python/pull/109#issuecomment-1783661783

   @Fokko thanks for the guidance and reference, I'll get things working with 
CI in a bit. Would I be able to have this assigned to me?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Use avro compression properties from table properties when writing manifests and manifest lists [iceberg]

2023-10-27 Thread via GitHub


wypoon commented on code in PR #6799:
URL: https://github.com/apache/iceberg/pull/6799#discussion_r1375148065


##
core/src/main/java/org/apache/iceberg/ManifestFiles.java:
##
@@ -157,11 +157,34 @@ public static ManifestWriter 
write(PartitionSpec spec, OutputFile outp
*/
   public static ManifestWriter write(
   int formatVersion, PartitionSpec spec, OutputFile outputFile, Long 
snapshotId) {
+return write(formatVersion, spec, outputFile, snapshotId, null, null);
+  }
+
+  /**
+   * Create a new {@link ManifestWriter} for the given format version.
+   *
+   * @param formatVersion a target format version
+   * @param spec a {@link PartitionSpec}
+   * @param outputFile an {@link OutputFile} where the manifest will be written
+   * @param snapshotId a snapshot ID for the manifest entries, or null for an 
inherited ID
+   * @param compressionCodec compression codec for the manifest file
+   * @param compressionLevel compression level of the compressionCodec
+   * @return a manifest writer
+   */
+  public static ManifestWriter write(

Review Comment:
   @sumeetgajjar originally used a `Map` parameter for this , and @rdblue 
commented that "we don't want to pass a map of properties around. That's 
exposing too much where it doesn't need to be, and people tend to misuse 
generic arguments like this."
   What I propose to do then is to introduce a `ManifestWriter.Options` class 
and use that here (instead of a `Map`). I'll also introduce a 
`ManifestListWriter.Options` class and use that in `ManifestLists.write`. These 
`Options` classes define what additional parameters are applicable and may be 
set. If in future, additional parameters are needed, they can be added to these 
`Options` classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] added contributing.md file [iceberg-python]

2023-10-27 Thread via GitHub


onemriganka commented on PR #102:
URL: https://github.com/apache/iceberg-python/pull/102#issuecomment-1783691176

   > @onemriganka What do you think of pointing to 
https://py.iceberg.apache.org/contributing/? Otherwise we have to maintain this 
at two places
   
   YES ,https://py.iceberg.apache.org/contributing/ <- this is great .But i 
think as begineers only find usefull files like README.md etc only in github. 
so when we have contributing.md file in the repo will be easy for begineer  to 
understand and contribute. professionals check website of the project but many 
begineer do not


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Enable column statistics filtering after planning [iceberg]

2023-10-27 Thread via GitHub


stevenzwu commented on code in PR #8803:
URL: https://github.com/apache/iceberg/pull/8803#discussion_r1375163433


##
core/src/main/java/org/apache/iceberg/BaseFile.java:
##
@@ -504,6 +508,27 @@ private static Map 
toReadableByteBufferMap(Map Map filterColumnsStats(
+  Map map, Set columnIds) {
+if (columnIds == null || columnIds.isEmpty()) {
+  return SerializableMap.copyOf(map);
+}
+
+if (map == null) {
+  return null;
+}
+
+Map filtered = 
Maps.newHashMapWithExpectedSize(columnIds.size());
+for (Integer columnId : columnIds) {
+  TypeT value = map.get(columnId);
+  if (value != null) {

Review Comment:
   thanks.  double fetching is probably ok since the columnsToKeepStats is 
likely not big and Map get is a O(1) anyway.
   
   The alternative is to start a discussion and revisit the spec or Parquet 
implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: add nanosecond timestamp types [iceberg]

2023-10-27 Thread via GitHub


jacobmarble commented on PR #8683:
URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1783700784

   > @jacobmarble indeed, thanks! do we expect any type promotions being 
allowed around the new types being added here?
   
   @findepi when we discussed in the last community meeting, that was 
discussed. We agreed to scope this PR, and #8657 to just adding the type, and 
let type promotion be a separate issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink: Add support for Flink 1.18 [iceberg]

2023-10-27 Thread via GitHub


YesOrNo828 commented on issue #8930:
URL: https://github.com/apache/iceberg/issues/8930#issuecomment-1783712015

   The list supported Flink versions for each connector:
   https://flink.apache.org/downloads/#apache-flink-connectors
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Spark write abort result in table miss metadata location file [iceberg]

2023-10-27 Thread via GitHub


dyno commented on issue #8927:
URL: https://github.com/apache/iceberg/issues/8927#issuecomment-1783720979

   probably the error message just means the data file and maniefst files but 
not the metadata location file? i am sure the medata location file was gone and 
from s3 access log the deletion was initiated by the same cluster at around the 
same time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink: Add support for Flink 1.18 [iceberg]

2023-10-27 Thread via GitHub


pvary commented on issue #8930:
URL: https://github.com/apache/iceberg/issues/8930#issuecomment-1783722052

   > > then we will have a transitive dependency on Flink 1.18. We can exclude 
the dependency, but how can we make sure that everything works as expected 
without running the tests.
   > 
   > You're right. There'll be a transitive dependency. If Flink1.17's APIs are 
incompatible with Flink1.18, I think we can copy some tests into the Flink1.17 
module to ensure it works fine and add end-to-end tests to guarantee 
compatibility with different Flink versions
   
   I think the issue here, is that we can not be sure, which dependency needs 
to be copied over in advance.
   
   I tried to upgrade to Flink 1.18 yesterday, and found that there is a Flink 
change which causes the TestFlinkMetaDataTable#testAllFilesPartitioned test to 
fail. See: 
https://apache-flink.slack.com/archives/C03G7LJTS2G/p1698402761715879?thread_ts=1698402761.715879&cid=C03G7LJTS2G
   
   One can argue that these issues are found at upgrade time, but if we add 
features between migration, then we will not find them, only if we run all of 
the tests.
   
   So I think running all of the tests on every supported Flink version is a 
must.
   
   > > As an alternative we can move these classes around every time when it is 
needed - but this approach has its own maintenance costs.
   > 
   > As of Flink 1.18, released a few days ago, the Flink community has 
externalized the [JDBC](https://github.com/apache/flink-connector-jdbc), 
[ES](https://github.com/apache/flink-connector-elasticsearch), 
[Kafka](https://github.com/apache/flink-connector-kafka), 
[pulsar](https://github.com/apache/flink-connector-pulsar), 
[HBase](https://github.com/apache/flink-connector-hbase), and so on connectors. 
That means the Flink API has become more compatible. So I think this 
maintenance cost between different Flink versions is acceptable.
   
   I am currently working on extending the Flink Sink API to provide features 
needed for Iceberg. See:
   - 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-371%3A+Provide+initialization+context+for+Committer+creation+in+TwoPhaseCommittingSink
   - 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-372%3A+Allow+TwoPhaseCommittingSink+WithPreCommitTopology+to+alter+the+type+of+the+Committable
   
   I would like to use them ASAP, and this will cause differences between 
1.18/1.19 versions of the Iceberg connector. If we have flink-common, then 
until 1.18 this should be in common, after 1.19 it should go to the version 
specific code, and after we drop support for 1.18 then it will go to the common 
code again. Also we would need to come up with the appropriate abstractions for 
separate the changing code from the fix code. These issues might be even more 
pronounced when the Flink 2.0 comes out, of which the discussion is already 
started.
   
   IMHO these are more complex tasks than simply backporting the changes to the 
other branches. Maybe that is the cause why the Iceberg-Spark code is handled 
in the same way.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Spark write abort result in table miss metadata location file [iceberg]

2023-10-27 Thread via GitHub


dyno commented on issue #8927:
URL: https://github.com/apache/iceberg/issues/8927#issuecomment-1783724655

   or the main problem is iceberg should not update metadata location in hive 
metastore before the write is actually completed. and the symptom is write is 
failed but the metadata location is updated so even the metadata location file 
is not deleted, the table is not in a consistent state. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org