Re: [I] EMR 6.10.0 Cannot migrate a table from a non-Iceberg Spark Session Catalog. Found spark_catalog [iceberg]
tomtongue commented on issue #7317: URL: https://github.com/apache/iceberg/issues/7317#issuecomment-1782418606 Sorry for jumping in. I personally investigated the migrate query issue for GlueCatalog, so let me share my investigation result. ## Result Currently, it’s NOT possible to run `migrate` query for Spark/Hive tables in Glue Data Catalog. The reason of this is that GlueCatalog client doesn’t support renaming tables currently. Let me elaborate that below. If I’m wrong, please correct me. ## Details When running the `migrate` query for a Spark/Hive table in Glue Data Catalog, as described above, the `SparkSessionCatalog` configuration should be specified like ` .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")`. In this case, the source table in the `migrate` query like `table => '$db.$table'` is always set to `spark_catalog` (if other catalog is specified, the Spark application will fail). For this, in the current design of `migrate` , the code path always goes through [`SparkSessionCatalog.renameTable`](https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java#L293) because as its specification, the `migrate` creates a staging table, renames the source table to keep the table as backup, and then migrate the source table to Iceberg. After the migration, the back table is dropped or not based on the `drop_backup` parameter. In the phase of renaming the source table to keep the backup table, the `SparkSessionCatalog.renameTable` is called. The `SparkSessionCatalog.renameTable` can handle the IcebergCatalog to rename the table in GlueCatalog, the method basically checks the source table and if the source table is Iceberg, then calls `IcebergCatalog.renameTable` (GlueCatalogImpl is specified here, so the `renameTable` in GlueCatalogImpl will be used). However, in this case, the source table always belongs to `spark_catalog`, therefore the code path always goes to `getSessionCatalog().renameTable` as follows: https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java#L293 ```java @Override public void renameTable(Identifier from, Identifier to) throws NoSuchTableException, TableAlreadyExistsException { // rename is not supported by HadoopCatalog. to avoid UnsupportedOperationException for session // catalog tables, // check table existence first to ensure that the table belongs to the Iceberg catalog. if (icebergCatalog.tableExists(from)) { icebergCatalog.renameTable(from, to); } else { getSessionCatalog().renameTable(from, to); // <= THIS PATH } } ``` `getSessionCatalog().renameTable` calls Hive APIs for the table in Glue Data Catalog, so it fails due to renaming failure. Here’s the detail of calling flow (in Iceberg 1.4.1 with Spark 3.5): 1. https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/MigrateTableProcedure.java#L76 -> Calls `MigrateTableSparkAction` 2. https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java#L118 -> The actual migration impl 1. `renameAndBackupSourceTable()` is called to keep the backup 2. https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java#L209 -> `renameAndBackupSourceTable`. `destCatalog().renameTable(...)` will be called. But the `destCatalog()` is defined by `this.destCatalog = checkDestinationCatalog(sourceCatalog);` in the L66 in the same Class. 3. https://github.com/apache/iceberg/blob/apache-iceberg-1.4.1/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java#L293 -> The `destCatalog` is `SparkSessionCatalog` in step 2, so the `getSessionCatalog().renameTable` will be called. ## Resolution If the GlueCatalog renameTable can be used to keep the backup table, it’s possible to run the `migrate`. To resolve this, for example, it’s possible to add a new option to specify the destination catalog. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
ajantha-bhat commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r137451 ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java: ## @@ -106,12 +152,12 @@ private Schema getTestSchema() { public void testCreateTableBuilder() throws Exception { Schema schema = getTestSchema(); PartitionSpec spec = PartitionSpec.builderFor(schema).bucket("data", 16).build(); -TableIdentifier tableIdent = TableIdentifier.of(DB_NAME, "tbl"); +TableIdentifier tableIdent = TableIdentifier.of(HIVE_DB_NAME, "tbl"); String location = temp.resolve("tbl").toString(); try { Table table = - catalog + catalog() Review Comment: If you keep a local variable `HiveCatalog catalog` and initialize it in `before()`, you don't have to change to a method call everywhere; ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java: ## @@ -82,12 +83,20 @@ import org.apache.iceberg.types.Types; import org.apache.iceberg.util.JsonUtil; import org.apache.thrift.TException; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import org.junit.jupiter.api.io.TempDir; import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.ValueSource; -public class TestHiveCatalog extends HiveMetastoreTest { +/** + * Run all the tests from abstract of {@link CatalogTests}. Also, a few specific tests for HIVE too. + * There could be some duplicated tests that are already being covered with {@link CatalogTests} + * //TODO: remove duplicate tests with {@link CatalogTests}.Also use the DB/TABLE/SCHEMA from {@link Review Comment: Can you please check and remove in this PR only? I don't think it should be a separate change. ## hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreSetup.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Map; +import java.util.concurrent.TimeUnit; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.HiveMetaStoreClient; +import org.apache.iceberg.CatalogProperties; +import org.apache.iceberg.CatalogUtil; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; + +/** + * Setup HiveMetastore. It does not create any database. All the tests should create a database + * accordingly. it should replace the existing setUp class {@link HiveMetastoreTest} + */ +class HiveMetastoreSetup { Review Comment: we cannot deprecate `HiveMetastoreTest` as it is used by many other classes. I am not sure the problem with that class. The methods we want are static, so there we can all those methods in beforeAll and afterAll right? I still think there is no need of this class. ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -261,6 +261,12 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new RuntimeException("Interrupted in call to rename", e); +} catch (RuntimeException e) { Review Comment: So, this change is needed now because `CatalogTests` is expecting that particular type of message? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
ajantha-bhat commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1374227697 ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveViewCatalog.java: ## @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import static org.assertj.core.api.Assertions.assertThat; +import static org.assertj.core.api.Assertions.assertThatThrownBy; + +import java.util.Collections; +import org.apache.iceberg.Transaction; +import org.apache.iceberg.catalog.Catalog; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.view.ViewCatalogTests; +import org.assertj.core.api.Assumptions; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +public class TestHiveViewCatalog extends ViewCatalogTests { + + private HiveMetastoreSetup hiveMetastoreSetup; + + @BeforeEach + public void before() throws Exception { +hiveMetastoreSetup = new HiveMetastoreSetup(Collections.emptyMap()); + } + + @AfterEach + public void after() throws Exception { +hiveMetastoreSetup.stopMetastore(); + } + + @Override + protected HiveCatalog catalog() { +return hiveMetastoreSetup.catalog; + } + + @Override + protected Catalog tableCatalog() { +return hiveMetastoreSetup.catalog; + } + + @Override + protected boolean requiresNamespaceCreate() { +return true; + } + + // Override few tests which are using AlreadyExistsException instead of NoSuchViewException Review Comment: I have some of this problem for Nessie catalog extending `ViewCatalogTests`. Should we change ViewCatalogTests to accept multiple options like I did for Nessie (https://github.com/apache/iceberg/pull/8909/files#diff-9c8b69f86aec86aea82cd6675c1c24267d9a9d77d7715c2c792934cb954b328bR403-R409) or generalize the code to throw expected exception? ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -264,6 +279,138 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } } + @Override + public boolean viewExists(TableIdentifier identifier) { +return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, TableType.VIRTUAL_VIEW); + } + + @Override + public boolean dropView(TableIdentifier identifier) { +if (!isValidIdentifier(identifier)) { + return false; +} +try { + String database = identifier.namespace().level(0); + String viewName = identifier.name(); + Table table = clients.run(client -> client.getTable(database, viewName)); + HiveCatalogUtil.validateTableIsIcebergView(table, fullTableName(name, identifier)); + clients.run( + client -> { +client.dropTable(database, viewName); +return null; + }); + LOG.info("Dropped View: {}", identifier); + return true; + +} catch (NoSuchViewException | NoSuchObjectException e) { + LOG.info("Skipping drop, View does not exist: {}", identifier, e); + return false; +} catch (TException e) { + throw new RuntimeException("Failed to drop " + identifier, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to dropView", e); +} + } + + @Override + public List listViews(Namespace namespace) { Review Comment: Can we just pass TableType and reduce code duplication like I did for Nessie? https://github.com/apache/iceberg/pull/8909/files#diff-4abd6d90c69587709bc754be9ea305080395bad9ac778176a0e3ab2563ac8464R145 Similar comment for all the view related APIs ## hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreSetup.java: ## @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "
[I] Slow RewriteManifests due to Validation of Manifest Entries [iceberg]
mirageyjd opened a new issue, #8932: URL: https://github.com/apache/iceberg/issues/8932 ### Apache Iceberg version 0.13.1 ### Query engine Spark ### Please describe the bug 🐞 We ran `BaseRewriteManifestsSparkAction` action on a large table with 7k+ manifests in Spark, and it took more than an hour unexpectedly. The most time-consuming procedure is to validate that each manifest entry in added manifests has a snapshot id, which is not executed in a distributed manner. Without the validation, the entire action takes less than 2 minutes. I wonder whether it is necessary to validate snapshot id of each manifest entry in manifests written by `BaseRewriteManifestsSparkAction`. It would be better such validation is optional and can be skipped in`BaseRewriteManifestsSparkAction`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
findepi commented on PR #8683: URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1782599430 > Now that we have separate types, we can use the type to carry that information, so that you can promote a `long` to `timestamp_ms` or `timestamp` and we know how to interpret the value. That's true as long as there is only type promotion only once. What should happen when `timestamp_ms` is promoted to `timestamp`? If it wasn't a `long` before, such promotion should be permitted. If it was a `long` before... how do we handle this case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg streaming streaming-skip-overwrite-snapshots SparkMicroBatchStream only skips over one file per trigger [iceberg]
cccs-jc commented on issue #8902: URL: https://github.com/apache/iceberg/issues/8902#issuecomment-1782660563 @singhpk234 do you know why the `existingFilesCount` are added to the count. Seems like it should only add the number of `addedFilesCount` . https://github.com/apache/iceberg/blob/e2b56daf35724700a9b57dbeee5fe23f99c592c4/core/src/main/java/org/apache/iceberg/MicroBatches.java#L95 In the test cases there are no snapshot with multiple manifest list so I'm not 100% sure it's correct or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Fix typo in `_primitive_to_phyisical` [iceberg-python]
whisk commented on issue #107: URL: https://github.com/apache/iceberg-python/issues/107#issuecomment-1782706542 Hi @Fokko, please consider PR #108 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Slow RewriteManifests due to Validation of Manifest Entries [iceberg]
RussellSpitzer commented on issue #8932: URL: https://github.com/apache/iceberg/issues/8932#issuecomment-1782759994 Do you have a flame graph or some evidence of this? My gut would say that would be a trivial check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support ser/deser of value [iceberg-rust]
ZENOTME commented on code in PR #82: URL: https://github.com/apache/iceberg-rust/pull/82#discussion_r1374447735 ## crates/iceberg/src/avro/schema.rs: ## @@ -30,6 +32,11 @@ use itertools::{Either, Itertools}; use serde_json::{Number, Value}; const FILED_ID_PROP: &str = "field-id"; +const UUID_BYTES: usize = 16; +const UUID_LOGICAL_TYPE: &str = "uuid"; +// # TODO +// This const may better to maintain in avro-rs. Review Comment: Have added a #86 to track them -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink: Add support for Flink 1.18 [iceberg]
YesOrNo828 commented on issue #8930: URL: https://github.com/apache/iceberg/issues/8930#issuecomment-1782883251 > then we will have a transitive dependency on Flink 1.18. We can exclude the dependency, but how can we make sure that everything works as expected without running the tests. You're right. There'll be a transitive dependency. If Flink1.17's APIs are incompatible with Flink1.18, I think we can copy some tests into the Flink1.17 module to ensure it works fine and add end-to-end tests to guarantee compatibility with different Flink versions > As an alternative we can move these classes around every time when it is needed - but this approach has its own maintenance costs. As of Flink 1.18, released a few days ago, the Flink community has externalized the [JDBC](https://github.com/apache/flink-connector-jdbc), [ES](https://github.com/apache/flink-connector-elasticsearch), [Kafka](https://github.com/apache/flink-connector-kafka), [pulsar](https://github.com/apache/flink-connector-pulsar), [HBase](https://github.com/apache/flink-connector-hbase), and so on connectors. That means the Flink API has become more compatible. So I think this maintenance cost between different Flink versions is acceptable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Iceberg 1.3.0 jc streaming [iceberg]
cccs-jc closed pull request #8934: Iceberg 1.3.0 jc streaming URL: https://github.com/apache/iceberg/pull/8934 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Iceberg 1.3.0 jc streaming [iceberg]
cccs-jc commented on PR #8934: URL: https://github.com/apache/iceberg/pull/8934#issuecomment-1782901768 not ready -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] patch: Parquet Column Names with "Special Characters" fix [iceberg-python]
MarquisC opened a new pull request, #109: URL: https://github.com/apache/iceberg-python/pull/109 We're using PyIceberg to read Iceberg tables stored in S3 as parquet. We have column names in the form of `id:foo` `diagnostic:bar` using `:` as a sort of delimiter to help us do some programatic maintenance on our side. In Parquet the column names are magically subbed in this case `:` -> `_x3A` and upon attempts at scanning/reading the data the schema of the table doesn't match the physical column names for PyArrow. The first pass is a naive fix for this that I have tested and works, but I'm looking for guidance on where you all want me to put this logic, and I'm happy to add it there instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] patch: Parquet Column Names with "Special Characters" fix [iceberg-python]
mchamberlain-mdsol commented on PR #109: URL: https://github.com/apache/iceberg-python/pull/109#issuecomment-1783097367 Exception Example: https://github.com/apache/iceberg-python/assets/110425760/74a7f812-333b-45ee-861c-cf581a99f3d3";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Avoid extra copies of manifests while optimizing V2 tables [iceberg]
singhpk234 commented on code in PR #8928: URL: https://github.com/apache/iceberg/pull/8928#discussion_r1374737812 ## core/src/test/java/org/apache/iceberg/TestRewriteManifests.java: ## @@ -443,6 +444,14 @@ public void testBasicManifestReplacement() throws IOException { List manifests = snapshot.allManifests(table.io()); Assert.assertEquals(3, manifests.size()); +if (formatVersion == 1) { + assertThat(manifests.get(0).path()).isNotEqualTo(firstNewManifest.path()); + assertThat(manifests.get(1).path()).isNotEqualTo(secondNewManifest.path()); +} else { + assertThat(manifests.get(0).path()).isEqualTo(firstNewManifest.path()); + assertThat(manifests.get(1).path()).isEqualTo(secondNewManifest.path()); +} Review Comment: [minor] A comment might be helpful shouldn't we check if it's format = 1 and snapshotIdInheritence enabled ? ## core/src/test/java/org/apache/iceberg/TestRewriteManifests.java: ## @@ -443,6 +444,14 @@ public void testBasicManifestReplacement() throws IOException { List manifests = snapshot.allManifests(table.io()); Assert.assertEquals(3, manifests.size()); +if (formatVersion == 1) { + assertThat(manifests.get(0).path()).isNotEqualTo(firstNewManifest.path()); + assertThat(manifests.get(1).path()).isNotEqualTo(secondNewManifest.path()); +} else { + assertThat(manifests.get(0).path()).isEqualTo(firstNewManifest.path()); + assertThat(manifests.get(1).path()).isEqualTo(secondNewManifest.path()); +} Review Comment: [minor] shouldn't we check if it's format = 1 and snapshotIdInheritence enabled ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix Migrate procedure renaming issue for custom catalog [iceberg]
singhpk234 commented on code in PR #8931: URL: https://github.com/apache/iceberg/pull/8931#discussion_r1374769692 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java: ## @@ -108,6 +109,23 @@ public MigrateTableSparkAction backupTableName(String tableName) { return this; } + @Override + public MigrateTableSparkAction destCatalogName(String catalogName) { +CatalogManager catalogManager = spark().sessionState().catalogManager(); + +CatalogPlugin catalogPlugin; +if (catalogManager.isCatalogRegistered(catalogName)) { + catalogPlugin = catalogManager.catalog(catalogName); +} else { + LOG.warn( + "{} doesn't exist in SparkSession. " + "Fallback to current SparkSession catalog.", + catalogName); + catalogPlugin = catalogManager.currentCatalog(); +} +this.destCatalog = checkDestinationCatalog(catalogPlugin); Review Comment: QQ: earlier we use to use sourceCatalog as destCatalog too, was this a problem ? can you please add more comments as to why sourceCatalog was picked as `spark_catalog` rather than the `glue_catalog` ?? since we were calling the migrate procedure from glue_catalog ? ## spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMigrateTableProcedure.java: ## @@ -142,6 +142,48 @@ public void testMigrateWithBackupTableName() throws IOException { Assertions.assertThat(spark.catalog().tableExists(dbName + "." + backupTableName)).isTrue(); } + @Test + public void testMigrateWithDestCatalogName() throws IOException { +Assume.assumeTrue(catalogName.equals("spark_catalog")); + +spark +.conf() +.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog"); + +String location = temp.newFolder().toString(); +sql( +"CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet LOCATION '%s'", +tableName, location); +sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName); + +Object result = +scalarSql( +"CALL %s.system.migrate(table => '%s', drop_backup => false, dest_catalog_name => '%s')", +catalogName, tableName, catalogName); +Assertions.assertThat(result).isEqualTo(1L); +Assertions.assertThat(spark.catalog().tableExists(tableName + "_BACKUP_")).isTrue(); + } + + @Test + public void testMigrateWithDestCatalogNameWithNonExistingCatalog() throws IOException { +Assume.assumeTrue(catalogName.equals("spark_catalog")); + +String destCatalogName = "non_existing_catalog"; + +String location = temp.newFolder().toString(); +sql( +"CREATE TABLE %s (id bigint NOT NULL, data string) USING parquet LOCATION '%s'", +tableName, location); +sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName); + +Object result = +scalarSql( +"CALL %s.system.migrate(table => '%s', drop_backup => false, dest_catalog_name => '%s')", +catalogName, tableName, destCatalogName); +Assertions.assertThat(result).isEqualTo(1L); +Assertions.assertThat(spark.catalog().tableExists(tableName + "_BACKUP_")).isTrue(); Review Comment: should check the warning log in this case, was wondering if giving incorrect destCatalogName should be thrown as Invalid input -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Partitioning by Year/Month/Day [iceberg]
l20DfX35JnKBfRn commented on issue #4129: URL: https://github.com/apache/iceberg/issues/4129#issuecomment-1783179835 If anyone is confused about this error on AWS Athena, unlike with hive-style file partitioning in S3 like `{table-s3-location}/year={}/month={}/day={}/` The equivalent in iceberg, e.g. ```SQL CREATE TABLE documents ( id string, created_at timestamp ... ) PARTITIONED BY ( day(created_at) ); LOCATION 's3://{table-s3-location}/' TBLPROPERTIES ('table_type' = 'ICEBERG'); ``` This will create separate partitions for each day exactly like hive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
jacobmarble commented on code in PR #8683: URL: https://github.com/apache/iceberg/pull/8683#discussion_r1374794900 ## format/spec.md: ## @@ -187,10 +189,11 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. Notes: 1. Decimal scale is fixed and cannot be changed by schema evolution. Precision can only be widened. -2. All time and timestamp values are stored with microsecond precision. -- Timestamps _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). -- Timestamps _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp values are stored as a long that encodes microseconds from the unix epoch. +2. `time`, `timestamp`, and `timestamptz` values are represented with _microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are represented with _nanosecond precision_. +- Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). +- Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 3. Character strings must be stored as UTF-8 encoded byte arrays. +4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables. Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
jacobmarble commented on PR #8683: URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1783184731 > > Now that we have separate types, we can use the type to carry that information, so that you can promote a `long` to `timestamp_ms` or `timestamp` and we know how to interpret the value. > > That's true as long as there is only type promotion only once. > > What should happen when `timestamp_ms` is promoted to `timestamp`? If it wasn't a `long` before, such promotion should be permitted. If it was a `long` before... how do we handle this case? FWIW, the millisecond timestamp types (`timestamp_ms`, `timestamptz_ms`) have been removed from this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Avoid extra copies of manifests while optimizing V2 tables [iceberg]
aokolnychyi commented on code in PR #8928: URL: https://github.com/apache/iceberg/pull/8928#discussion_r1374826073 ## core/src/test/java/org/apache/iceberg/TestRewriteManifests.java: ## @@ -443,6 +444,14 @@ public void testBasicManifestReplacement() throws IOException { List manifests = snapshot.allManifests(table.io()); Assert.assertEquals(3, manifests.size()); +if (formatVersion == 1) { + assertThat(manifests.get(0).path()).isNotEqualTo(firstNewManifest.path()); + assertThat(manifests.get(1).path()).isNotEqualTo(secondNewManifest.path()); +} else { + assertThat(manifests.get(0).path()).isEqualTo(firstNewManifest.path()); + assertThat(manifests.get(1).path()).isEqualTo(secondNewManifest.path()); +} Review Comment: There are actually two different tests. This one called basicReplacement and tests the default behavior. The one below explicitly validates enabled snapshot ID inheritance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Fix typo in `_primitive_to_phyisical` [iceberg-python]
Fokko commented on issue #107: URL: https://github.com/apache/iceberg-python/issues/107#issuecomment-1783395139 Nice, thanks @whisk 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fixed typos [iceberg-python]
Fokko merged PR #108: URL: https://github.com/apache/iceberg-python/pull/108 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] patch: Parquet Column Names with "Special Characters" fix [iceberg-python]
Fokko commented on PR #109: URL: https://github.com/apache/iceberg-python/pull/109#issuecomment-1783408073 Thanks for raising this @MarquisC. This looks like https://github.com/apache/iceberg-python/pull/83/, can you check if that also resolves your problem? Otherwise, I think it will be a good place to add it here as well. It also shows how to test this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] added contributing.md file [iceberg-python]
Fokko commented on PR #102: URL: https://github.com/apache/iceberg-python/pull/102#issuecomment-1783409103 @onemriganka What do you think of pointing to https://py.iceberg.apache.org/contributing/? Otherwise we have to maintain this at two places -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
findepi commented on PR #8683: URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1783460277 @jacobmarble indeed, thanks! do we expect any type promotions being allowed around the new types being added here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
Fokko commented on code in PR #8683: URL: https://github.com/apache/iceberg/pull/8683#discussion_r1375028979 ## format/spec.md: ## @@ -187,10 +189,11 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. Notes: 1. Decimal scale is fixed and cannot be changed by schema evolution. Precision can only be widened. -2. All time and timestamp values are stored with microsecond precision. -- Timestamps _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). -- Timestamps _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp values are stored as a long that encodes microseconds from the unix epoch. +2. `time`, `timestamp`, and `timestamptz` values are represented with _microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are represented with _nanosecond precision_. +- Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). +- Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 3. Character strings must be stored as UTF-8 encoded byte arrays. +4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables. Review Comment: I like Ryan's suggestion of having a full new column. We can expect more types to be added so that a column would make sense to me, instead of adding it to the `requirements` column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]
amogh-jahagirdar commented on PR #8925: URL: https://github.com/apache/iceberg/pull/8925#issuecomment-1783496227 > @amogh-jahagirdar, maybe this time we should create a test to validate FileScanTask.split when there are bad split offsets? I think that would have caught this in the last PR. Done, added a new test which will exercise both `splitOffsetArray` and `splitOffsets`. This will also validate the number of tasks per file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]
nastra commented on code in PR #8925: URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375034300 ## core/src/test/java/org/apache/iceberg/TestSplitPlanning.java: ## @@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() { "We should get one task per row group", 32, Iterables.size(scan.planTasks())); } + @Test + public void testSplitPlanningWithCorruptedOffsets() throws IOException { +List files16Mb = newFilesWithInvalidOffset(16, 16 * 1024 * 1024, 2); +appendFiles(files16Mb); + +// There will be 4 tasks per file, instead of 2 tasks per file, since the offsets are invalid +// and will be ignored. +TableScan scan = +table +.newScan() +.option(TableProperties.SPLIT_OPEN_FILE_COST, String.valueOf(0)) +.option(TableProperties.SPLIT_SIZE, String.valueOf(4L * 1024 * 1024)); + +Assert.assertEquals( Review Comment: minor: for newly added test code it would be better to use AssertJ-style assertions to make a future migration away from JUnit4 easier -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]
nastra commented on code in PR #8925: URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375034993 ## core/src/test/java/org/apache/iceberg/TestSplitPlanning.java: ## @@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() { "We should get one task per row group", 32, Iterables.size(scan.planTasks())); } + @Test + public void testSplitPlanningWithCorruptedOffsets() throws IOException { +List files16Mb = newFilesWithInvalidOffset(16, 16 * 1024 * 1024, 2); +appendFiles(files16Mb); + +// There will be 4 tasks per file, instead of 2 tasks per file, since the offsets are invalid +// and will be ignored. +TableScan scan = +table +.newScan() +.option(TableProperties.SPLIT_OPEN_FILE_COST, String.valueOf(0)) +.option(TableProperties.SPLIT_SIZE, String.valueOf(4L * 1024 * 1024)); + +Assert.assertEquals( +"We should get four tasks per file since the offsets will be ignored", +64, +Iterables.size(scan.planTasks())); +try (CloseableIterable tasks = scan.planTasks()) { + CombinedScanTask task = tasks.iterator().next().asCombinedScanTask(); + Collection files = task.files(); + for (FileScanTask fileScanTask : files) { +DataFile dataFile = fileScanTask.file(); +Assert.assertNull( Review Comment: same as above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add latest version to menu [iceberg-docs]
Fokko merged PR #291: URL: https://github.com/apache/iceberg-docs/pull/291 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add latest version to menu [iceberg-docs]
Fokko commented on PR #291: URL: https://github.com/apache/iceberg-docs/pull/291#issuecomment-1783503359 Thanks @nastra for fixing this 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]
nastra commented on code in PR #8925: URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375037491 ## core/src/test/java/org/apache/iceberg/TestSplitPlanning.java: ## @@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() { "We should get one task per row group", 32, Iterables.size(scan.planTasks())); } + @Test + public void testSplitPlanningWithCorruptedOffsets() throws IOException { Review Comment: unfortunately that test passes for me locally without the new fix in `BaseFile` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]
nastra commented on code in PR #8925: URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375037491 ## core/src/test/java/org/apache/iceberg/TestSplitPlanning.java: ## @@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() { "We should get one task per row group", 32, Iterables.size(scan.planTasks())); } + @Test + public void testSplitPlanningWithCorruptedOffsets() throws IOException { Review Comment: unfortunately that test passes for me locally without the new fix in `BaseFile` (I woud have expected it to fail without the fix) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]
nastra commented on code in PR #8925: URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375037491 ## core/src/test/java/org/apache/iceberg/TestSplitPlanning.java: ## @@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() { "We should get one task per row group", 32, Iterables.size(scan.planTasks())); } + @Test + public void testSplitPlanningWithCorruptedOffsets() throws IOException { Review Comment: that test passes for me locally without the new fix in `BaseFile` (I woud have expected it to fail without the fix) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Clarify missing fields when writing [iceberg]
Fokko commented on code in PR #8672: URL: https://github.com/apache/iceberg/pull/8672#discussion_r1375043993 ## format/spec.md: ## @@ -128,13 +128,13 @@ Tables do not require rename, except for tables that use atomic rename to implem Writer requirements -Some tables in this spec have columns that specify requirements for v1 and v2 tables. These requirements are intended for writers when adding metadata files to a table with the given version. +Some tables in this spec have columns that specify requirements for v1 and v2 tables. These requirements are intended for writers when adding metadata (including manifests files and manifest lists) files to a table with the given version. -| Requirement | Write behavior | -|-|| -| (blank) | The field should be omitted | -| _optional_ | The field can be written | -| _required_ | The field must be written | +| Requirement | Write behavior | Review Comment: That's a good point, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Implementation does not write `schema-id` into Manifest Avro headers [iceberg]
Fokko commented on issue #8745: URL: https://github.com/apache/iceberg/issues/8745#issuecomment-1783512462 @JFinis I think this was in there so the schema didn't need to be deserialized. > Suggestion: Remove mentioning of field completely from the spec. It's redundant and the implementation doesn't seem to write it anyway. WDYT of creating a PR, and also raising this on the dev list? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Spark 3.5: Don't cache or reuse manifest entries while rewriting metadata by default [iceberg]
aokolnychyi opened a new pull request, #8935: URL: https://github.com/apache/iceberg/pull/8935 The action for rewriting manifests caches the manifest entry DF or does an extra shuffle in order to skip reading the actual manifest files twice. We did this assuming it would increase the performance. However, the caching seems to perform poorly for larger tables as it requires substantial cluster resources. In addition, doing a round-robin repartition is expensive as the entries must be written to disk. The extra write is actually more expensive than the extra read required for the range-based shuffle of manifest entries. Therefore, this PR disables caching by default and removes the optional round-robin repartition step. Instead, we will read the manifests twice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Spark 3.5: Use DataFile constants in SparkDataFile [iceberg]
aokolnychyi opened a new pull request, #8936: URL: https://github.com/apache/iceberg/pull/8936 This PR makes `SparkDataFile` use constants in `DataFile` instead of hard-coded values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
nastra commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375049907 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they * are informational and can be Review Comment: ```suggestion * cycle. Events with this payload are not consumed by the sink, they are informational and can be ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
nastra commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375050867 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they * are informational and can be + * used by consumers to trigger downstream processes. + */ +public class CommitCompletePayload implements Payload { + + private UUID commitId; + private Long vtts; Review Comment: the name is somewhat cryptic, would it make sense to rename this to `validThroughTimestamp`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
nastra commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375053314 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Event.java: ## @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroEncoderUtil; +import org.apache.iceberg.avro.AvroSchemaUtil; +import org.apache.iceberg.data.avro.DecoderResolver; + +/** + * Class representing all events produced to the control topic. Different event types have different + * payloads. + */ +public class Event implements Element { + + private UUID id; + private EventType type; + private Long timestamp; + private String groupId; + private Payload payload; + private final Schema avroSchema; + + public static byte[] encode(Event event) { +try { + return AvroEncoderUtil.encode(event, event.getSchema()); +} catch (IOException e) { + throw new UncheckedIOException(e); +} + } + + public static Event decode(byte[] bytes) { +try { + Event event = AvroEncoderUtil.decode(bytes); + // clear the cache to avoid memory leak + DecoderResolver.clearCache(); + return event; +} catch (IOException e) { + throw new UncheckedIOException(e); +} + } + + // Used by Avro reflection to instantiate this class when reading events + public Event(Schema avroSchema) { +this.avroSchema = avroSchema; + } + + public Event(String groupId, EventType type, Payload payload) { +this.id = UUID.randomUUID(); +this.type = type; +this.timestamp = System.currentTimeMillis(); +this.groupId = groupId; +this.payload = payload; + +this.avroSchema = +SchemaBuilder.builder() +.record(getClass().getName()) +.fields() +.name("id") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1500) +.type(UUID_SCHEMA) +.noDefault() +.name("type") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1501) +.type() +.intType() +.noDefault() +.name("timestamp") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1502) +.type() +.longType() +.noDefault() +.name("payload") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1503) +.type(payload.getSchema()) +.noDefault() +.name("groupId") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1504) +.type() +.stringType() +.noDefault() +.endRecord(); + } + + public UUID id() { +return id; + } + + public EventType type() { +return type; + } + + public Long timestamp() { +return timestamp; + } + + public Payload payload() { +return payload; + } + + public String groupId() { +return groupId; + } + + @Override + public Schema getSchema() { Review Comment: minor: I believe we typically don't use a `get` prefix in the Iceberg codebase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375054000 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/Event.java: ## @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroEncoderUtil; +import org.apache.iceberg.avro.AvroSchemaUtil; +import org.apache.iceberg.data.avro.DecoderResolver; + +/** + * Class representing all events produced to the control topic. Different event types have different + * payloads. + */ +public class Event implements Element { + + private UUID id; + private EventType type; + private Long timestamp; + private String groupId; + private Payload payload; + private final Schema avroSchema; + + public static byte[] encode(Event event) { +try { + return AvroEncoderUtil.encode(event, event.getSchema()); +} catch (IOException e) { + throw new UncheckedIOException(e); +} + } + + public static Event decode(byte[] bytes) { +try { + Event event = AvroEncoderUtil.decode(bytes); + // clear the cache to avoid memory leak + DecoderResolver.clearCache(); + return event; +} catch (IOException e) { + throw new UncheckedIOException(e); +} + } + + // Used by Avro reflection to instantiate this class when reading events + public Event(Schema avroSchema) { +this.avroSchema = avroSchema; + } + + public Event(String groupId, EventType type, Payload payload) { +this.id = UUID.randomUUID(); +this.type = type; +this.timestamp = System.currentTimeMillis(); +this.groupId = groupId; +this.payload = payload; + +this.avroSchema = +SchemaBuilder.builder() +.record(getClass().getName()) +.fields() +.name("id") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1500) +.type(UUID_SCHEMA) +.noDefault() +.name("type") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1501) +.type() +.intType() +.noDefault() +.name("timestamp") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1502) +.type() +.longType() +.noDefault() +.name("payload") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1503) +.type(payload.getSchema()) +.noDefault() +.name("groupId") +.prop(AvroSchemaUtil.FIELD_ID_PROP, 1504) +.type() +.stringType() +.noDefault() +.endRecord(); + } + + public UUID id() { +return id; + } + + public EventType type() { +return type; + } + + public Long timestamp() { +return timestamp; + } + + public Payload payload() { +return payload; + } + + public String groupId() { +return groupId; + } + + @Override + public Schema getSchema() { Review Comment: I left this one in as it is an override -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
nastra commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375054505 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java: ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import static java.util.stream.Collectors.toList; + +import java.util.Arrays; +import java.util.List; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.avro.util.Utf8; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.TableIdentifier; + +public class TableName implements Element { Review Comment: would `Identifier` or `FullIdentifier` make sense here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
nastra commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375054505 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java: ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import static java.util.stream.Collectors.toList; + +import java.util.Arrays; +import java.util.List; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.avro.util.Utf8; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.TableIdentifier; + +public class TableName implements Element { Review Comment: nit: would `FullTableIdentifier` / `FullIdentifier` make sense here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
stevenzwu commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1373440573 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -504,6 +508,27 @@ private static Map toReadableByteBufferMap(Map Map filterColumnsStats( + Map map, Set columnIds) { +if (columnIds == null || columnIds.isEmpty()) { + return SerializableMap.copyOf(map); +} + +if (map == null) { + return null; +} + +Map filtered = Maps.newHashMapWithExpectedSize(columnIds.size()); +for (Integer columnId : columnIds) { + TypeT value = map.get(columnId); + if (value != null) { Review Comment: thanks for the unit test code. I am still not comfortable of making the assumption as the Iceberg spec clearly said null stats value is possible. what if Trino parquet writer produces min/max stats with null value. ``` Lower bound for the non-null, non-NaN values in the partition field, or null if all values are null or NaN [2] ``` As least, there is a mismatch in the Parquet writer implementation and the spec. Unless we are going to clarify/update the spec, I feel it is safer to do `map.containsKey(key)`. Maybe @RussellSpitzer and @rdblue can chime in here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1375068537 ## aws/src/integration/java/org/apache/iceberg/aws/lakeformation/LakeFormationTestBase.java: ## @@ -357,8 +360,20 @@ String getRandomTableName() { return LF_TEST_TABLE_PREFIX + UUID.randomUUID().toString().replace("-", ""); } - private static void waitForIamConsistency() throws Exception { -Thread.sleep(IAM_PROPAGATION_DELAY); // sleep to make sure IAM up to date + private static void waitForIamConsistency(String roleName, String policyName) { +// wait to make sure IAM up to date +Awaitility.await() +.pollDelay(Duration.ofSeconds(1)) +.atMost(Duration.ofSeconds(10)) +.until( Review Comment: this should do the same `untilAsserted()` check that I mentioned earlier -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1375070758 ## aws/src/integration/java/org/apache/iceberg/aws/lakeformation/LakeFormationTestBase.java: ## @@ -417,7 +432,20 @@ private static void registerResource(String s3Location) { .build()); // when a resource is registered, LF will update SLR with necessary permissions which has a // propagation delay - waitForIamConsistency(); + Awaitility.await() + .pollDelay(Duration.ofSeconds(1)) + .atMost(Duration.ofSeconds(10)) + .ignoreExceptions() + .untilAsserted( + () -> + Assertions.assertThat( + lakeformation + .describeResource( + DescribeResourceRequest.builder().resourceArn(arn).build()) + .resourceInfo() + .roleArn() + .equalsIgnoreCase(lfRegisterPathRoleArn)) Review Comment: this should do `.isEqualToIgnoringCase(lfRegisterPathRoleArn))` rather than `isTrue()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1375070935 ## aws/src/integration/java/org/apache/iceberg/aws/lakeformation/TestLakeFormationAwsClientFactory.java: ## @@ -128,8 +131,18 @@ public void testLakeFormationEnabledGlueCatalog() throws Exception { + glueArnPrefix + ":userDefinedFunction/allowed_*/*\"]}]}") .build()); -waitForIamConsistency(); - +Awaitility.await() +.pollDelay(Duration.ofSeconds(1)) +.atMost(Duration.ofSeconds(10)) +.until( Review Comment: this should be consistent with other checks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375071931 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they * are informational and can be + * used by consumers to trigger downstream processes. + */ +public class CommitCompletePayload implements Payload { + + private UUID commitId; + private Long vtts; Review Comment: Sure, I updated this to `validThroughTs` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Replace Thread.Sleep() usage with org.Awaitility from Tests. [iceberg]
nastra commented on code in PR #8804: URL: https://github.com/apache/iceberg/pull/8804#discussion_r1375072145 ## core/src/test/java/org/apache/iceberg/hadoop/TestHadoopCommits.java: ## @@ -435,13 +437,11 @@ public void testConcurrentFastAppends(@TempDir File dir) throws Exception { for (int numCommittedFiles = 0; numCommittedFiles < numberOfCommitedFilesPerThread; numCommittedFiles++) { -while (barrier.get() < numCommittedFiles * threadsCount) { - try { -Thread.sleep(10); - } catch (InterruptedException e) { -throw new RuntimeException(e); - } -} +final int currentFilesCount = numCommittedFiles; +Awaitility.await() +.pollInterval(Duration.ofMillis(10)) +.pollInSameThread() Review Comment: then we need to figure out why the test was flaky. Was it occassionally failing? The purpose of using Awaitility is to make tests more robust and less flaky, so we need to figure out why this is flaky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375073000 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they * are informational and can be Review Comment: Thanks for catching this, I fix this ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they * are informational and can be Review Comment: Thanks for catching this, I fixed this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
nastra commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1375074020 ## hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreSetup.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Map; +import java.util.concurrent.TimeUnit; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.HiveMetaStoreClient; +import org.apache.iceberg.CatalogProperties; +import org.apache.iceberg.CatalogUtil; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; + +/** + * Setup HiveMetastore. It does not create any database. All the tests should create a database + * accordingly. it should replace the existing setUp class {@link HiveMetastoreTest} + */ +class HiveMetastoreSetup { Review Comment: the conclusion here is that we should remove this class. As I mentioned in another comment, the alternative would be to have a JUnit5 extension (similar to `NessieJaxRsExtension`) that would to the proper setup/teardown for tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
nastra commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1375074725 ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java: ## @@ -96,6 +105,43 @@ public class TestHiveCatalog extends HiveMetastoreTest { @TempDir private Path temp; + private HiveMetastoreSetup hiveMetastoreSetup; + + private static final String HIVE_DB_NAME = "hivedb"; Review Comment: why not just call this `DB_NAME` so that you don't have to adjust all of the places? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
nastra commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1375075096 ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java: ## @@ -349,26 +396,31 @@ public void testCreateTableCustomSortOrder() throws Exception { assertThat(hmsTableParameters()) .containsEntry(DEFAULT_SORT_ORDER, SortOrderParser.toJson(table.sortOrder())); } finally { - catalog.dropTable(tableIdent); + catalog().dropTable(tableIdent); } } + @Override + protected HiveCatalog catalog() { Review Comment: this should be at the top -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
nastra commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1375077284 ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java: ## @@ -850,51 +857,46 @@ private void removeNamespaceOwnershipAndVerify( createNamespaceAndVerifyOwnership( name, propToCreate, expectedOwnerPostCreate, expectedOwnerTypePostCreate); -catalog.removeProperties(Namespace.of(name), propToRemove); +catalog().removeProperties(Namespace.of(name), propToRemove); -Database database = metastoreClient.getDatabase(name); +Database database = hiveMetastoreSetup.metastoreClient.getDatabase(name); assertThat(database.getOwnerName()).isEqualTo(expectedOwnerPostRemove); assertThat(database.getOwnerType()).isEqualTo(expectedOwnerTypePostRemove); } @Test - public void testDropNamespace() throws TException { + public void testDropEmptyNamespace() throws TException { Review Comment: can we please leave refactorings out of this PR? THis file should only have a minimal set of changes (by adding the setup/teardown stuff( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375077725 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java: ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import static java.util.stream.Collectors.toList; + +import java.util.Arrays; +import java.util.List; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.avro.util.Utf8; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.TableIdentifier; + +public class TableName implements Element { Review Comment: `FullTableName` or `TableIdentity` maybe? Having `Table` in the name would be helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
nastra commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375084999 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java: ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import static java.util.stream.Collectors.toList; + +import java.util.Arrays; +import java.util.List; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.avro.util.Utf8; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.TableIdentifier; + +public class TableName implements Element { Review Comment: `FullTableName` sounds like a good idea -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375085697 ## nessie/src/main/java/org/apache/iceberg/nessie/UpdateableReference.java: ## @@ -62,7 +62,7 @@ public Reference getReference() { public void checkMutable() { Preconditions.checkArgument( -mutable, "You can only mutate tables when using a branch without a hash or timestamp."); +mutable, "You can only mutate contents when using a branch without a hash or timestamp."); Review Comment: nit: should this say `content` maybe? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375086307 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.nessie; + +import java.util.Map; +import java.util.concurrent.atomic.AtomicBoolean; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.iceberg.view.ViewMetadataParser; +import org.projectnessie.client.http.HttpClientException; +import org.projectnessie.error.NessieBadRequestException; +import org.projectnessie.error.NessieConflictException; +import org.projectnessie.error.NessieNotFoundException; +import org.projectnessie.model.Content; +import org.projectnessie.model.ContentKey; +import org.projectnessie.model.IcebergTable; +import org.projectnessie.model.IcebergView; +import org.projectnessie.model.Reference; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class NessieViewOperations extends BaseViewOperations { Review Comment: minor: does this need to be public? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375087365 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.nessie; + +import java.util.Map; +import java.util.concurrent.atomic.AtomicBoolean; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.iceberg.view.ViewMetadataParser; +import org.projectnessie.client.http.HttpClientException; +import org.projectnessie.error.NessieBadRequestException; +import org.projectnessie.error.NessieConflictException; +import org.projectnessie.error.NessieNotFoundException; +import org.projectnessie.model.Content; +import org.projectnessie.model.ContentKey; +import org.projectnessie.model.IcebergTable; +import org.projectnessie.model.IcebergView; +import org.projectnessie.model.Reference; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class NessieViewOperations extends BaseViewOperations { + + private static final Logger LOG = LoggerFactory.getLogger(NessieViewOperations.class); + + private final NessieIcebergClient client; + Review Comment: nit: line can be removed so that all final fields are grouped together -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375087754 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java: ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import static java.util.stream.Collectors.toList; + +import java.util.Arrays; +import java.util.List; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.avro.util.Utf8; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.TableIdentifier; + +public class TableName implements Element { Review Comment: I made this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375088374 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.nessie; + +import java.util.Map; +import java.util.concurrent.atomic.AtomicBoolean; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.iceberg.view.ViewMetadataParser; +import org.projectnessie.client.http.HttpClientException; +import org.projectnessie.error.NessieBadRequestException; +import org.projectnessie.error.NessieConflictException; +import org.projectnessie.error.NessieNotFoundException; +import org.projectnessie.model.Content; +import org.projectnessie.model.ContentKey; +import org.projectnessie.model.IcebergTable; +import org.projectnessie.model.IcebergView; +import org.projectnessie.model.Reference; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class NessieViewOperations extends BaseViewOperations { + + private static final Logger LOG = LoggerFactory.getLogger(NessieViewOperations.class); + + private final NessieIcebergClient client; + + private final ContentKey key; + private final FileIO fileIO; + private final Map catalogOptions; + private IcebergView icebergView; + + NessieViewOperations( + ContentKey key, + NessieIcebergClient client, + FileIO fileIO, + Map catalogOptions) { +this.key = key; +this.client = client; +this.fileIO = fileIO; +this.catalogOptions = catalogOptions; + } + + @Override + public void doRefresh() { +try { + client.refresh(); +} catch (NessieNotFoundException e) { + throw new RuntimeException( + String.format( + "Failed to refresh as ref '%s' is no longer valid.", client.getRef().getName()), + e); +} +String metadataLocation = null; +Reference reference = client.getRef().getReference(); +try { + Content content = client.getApi().getContent().key(key).reference(reference).get().get(key); + LOG.debug("Content '{}' at '{}': {}", key, reference, content); + if (content == null) { +if (currentMetadataLocation() != null) { + throw new NoSuchViewException("View does not exist: %s in %s", key, reference); +} + } else { +this.icebergView = +content +.unwrap(IcebergView.class) +.orElseThrow( +() -> { + if (content instanceof IcebergTable) { +return new AlreadyExistsException( +"Table with same name already exists: %s in %s", key, reference); + } else { +return new IllegalStateException( +String.format( +"Cannot refresh Iceberg view: Nessie points to a non-Iceberg object for path: %s.", +key)); + } +}); +metadataLocation = icebergView.getMetadataLocation(); + } +} catch (NessieNotFoundException ex) { + if (currentMetadataLocation() != null) { +throw new NoSuchViewException("View does not exist: %s in %s", key, reference); + } +} +refreshFromMetadataLocation(metadataLocation, null, 2, l -> loadViewMetadata(l, reference)); + } + + private ViewMetadata loadViewMetadata(String metadataLocation, Reference reference) { +ViewMetadata metadata = ViewMetadataParser.read(io().newInputFile(metadataLocation)); +Map newProperties = Maps.newHashMap(metadata.properties()); +newProperties.put(NessieTableOperations.NESSIE_COMMIT_ID_PROPERTY, reference.getHash()); + +return ViewMetadata.buildFrom( + ViewMetadata.buildFrom(metadata).setProperties(newProperties).build()) +
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375089472 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.nessie; + +import java.util.Map; +import java.util.concurrent.atomic.AtomicBoolean; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.iceberg.view.ViewMetadataParser; +import org.projectnessie.client.http.HttpClientException; +import org.projectnessie.error.NessieBadRequestException; +import org.projectnessie.error.NessieConflictException; +import org.projectnessie.error.NessieNotFoundException; +import org.projectnessie.model.Content; +import org.projectnessie.model.ContentKey; +import org.projectnessie.model.IcebergTable; +import org.projectnessie.model.IcebergView; +import org.projectnessie.model.Reference; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class NessieViewOperations extends BaseViewOperations { + + private static final Logger LOG = LoggerFactory.getLogger(NessieViewOperations.class); + + private final NessieIcebergClient client; + + private final ContentKey key; + private final FileIO fileIO; + private final Map catalogOptions; + private IcebergView icebergView; + + NessieViewOperations( + ContentKey key, + NessieIcebergClient client, + FileIO fileIO, + Map catalogOptions) { +this.key = key; +this.client = client; +this.fileIO = fileIO; +this.catalogOptions = catalogOptions; + } + + @Override + public void doRefresh() { +try { + client.refresh(); +} catch (NessieNotFoundException e) { + throw new RuntimeException( + String.format( + "Failed to refresh as ref '%s' is no longer valid.", client.getRef().getName()), + e); +} +String metadataLocation = null; +Reference reference = client.getRef().getReference(); +try { + Content content = client.getApi().getContent().key(key).reference(reference).get().get(key); + LOG.debug("Content '{}' at '{}': {}", key, reference, content); + if (content == null) { +if (currentMetadataLocation() != null) { + throw new NoSuchViewException("View does not exist: %s in %s", key, reference); +} + } else { +this.icebergView = +content +.unwrap(IcebergView.class) +.orElseThrow( +() -> { + if (content instanceof IcebergTable) { +return new AlreadyExistsException( +"Table with same name already exists: %s in %s", key, reference); + } else { +return new IllegalStateException( +String.format( +"Cannot refresh Iceberg view: Nessie points to a non-Iceberg object for path: %s.", +key)); + } +}); +metadataLocation = icebergView.getMetadataLocation(); + } +} catch (NessieNotFoundException ex) { + if (currentMetadataLocation() != null) { +throw new NoSuchViewException("View does not exist: %s in %s", key, reference); + } +} +refreshFromMetadataLocation(metadataLocation, null, 2, l -> loadViewMetadata(l, reference)); + } + + private ViewMetadata loadViewMetadata(String metadataLocation, Reference reference) { +ViewMetadata metadata = ViewMetadataParser.read(io().newInputFile(metadataLocation)); +Map newProperties = Maps.newHashMap(metadata.properties()); +newProperties.put(NessieTableOperations.NESSIE_COMMIT_ID_PROPERTY, reference.getHash()); + +return ViewMetadata.buildFrom( + ViewMetadata.buildFrom(metadata).setProperties(newProperties).build()) +
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375090864 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java: ## @@ -347,4 +348,54 @@ private TableIdentifier identifierWithoutTableReference( protected Map properties() { return catalogOptions; } + + @Override + protected ViewOperations newViewOps(TableIdentifier identifier) { +TableReference tr = parseTableReference(identifier); +return new NessieViewOperations( +ContentKey.of( + org.projectnessie.model.Namespace.of(identifier.namespace().levels()), tr.getName()), +client.withReference(tr.getReference(), tr.getHash()), +fileIO, +catalogOptions); + } + + @Override + public List listViews(Namespace namespace) { +return client.listViews(namespace); + } + + @Override + public boolean dropView(TableIdentifier identifier) { +TableReference tableReference = parseTableReference(identifier); +return client +.withReference(tableReference.getReference(), tableReference.getHash()) +.dropView(identifierWithoutTableReference(identifier, tableReference), false); + } + + @Override + public void renameView(TableIdentifier from, TableIdentifier to) { +TableReference fromTableReference = parseTableReference(from); +TableReference toTableReference = parseTableReference(to); +String fromReference = +fromTableReference.hasReference() +? fromTableReference.getReference() +: client.getRef().getName(); +String toReference = +toTableReference.hasReference() +? toTableReference.getReference() +: client.getRef().getName(); +Preconditions.checkArgument( +fromReference.equalsIgnoreCase(toReference), +"from: %s and to: %s reference name must be same", Review Comment: I think it would be good to make this error msg clearer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375094428 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java: ## @@ -347,4 +348,54 @@ private TableIdentifier identifierWithoutTableReference( protected Map properties() { return catalogOptions; } + + @Override + protected ViewOperations newViewOps(TableIdentifier identifier) { +TableReference tr = parseTableReference(identifier); +return new NessieViewOperations( +ContentKey.of( + org.projectnessie.model.Namespace.of(identifier.namespace().levels()), tr.getName()), +client.withReference(tr.getReference(), tr.getHash()), +fileIO, +catalogOptions); + } + + @Override + public List listViews(Namespace namespace) { +return client.listViews(namespace); + } + + @Override + public boolean dropView(TableIdentifier identifier) { +TableReference tableReference = parseTableReference(identifier); +return client +.withReference(tableReference.getReference(), tableReference.getHash()) +.dropView(identifierWithoutTableReference(identifier, tableReference), false); + } + + @Override + public void renameView(TableIdentifier from, TableIdentifier to) { +TableReference fromTableReference = parseTableReference(from); +TableReference toTableReference = parseTableReference(to); +String fromReference = +fromTableReference.hasReference() +? fromTableReference.getReference() +: client.getRef().getName(); +String toReference = +toTableReference.hasReference() +? toTableReference.getReference() +: client.getRef().getName(); +Preconditions.checkArgument( +fromReference.equalsIgnoreCase(toReference), +"from: %s and to: %s reference name must be same", +fromReference, +toReference); + +client +.withReference(fromTableReference.getReference(), fromTableReference.getHash()) +.renameView( +identifierWithoutTableReference(from, fromTableReference), +NessieUtil.removeCatalogName( Review Comment: any particular reason why `NessieUtil.removeCatalogName(..)` is only called on the `to` ref and not on the `from` ref? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375094889 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java: ## @@ -136,15 +143,23 @@ private UpdateableReference loadReference(String requestedRef, String hash) { } public List listTables(Namespace namespace) { +return listContents(namespace, Content.Type.ICEBERG_TABLE); + } + + public List listViews(Namespace namespace) { +return listContents(namespace, Content.Type.ICEBERG_VIEW); + } + + private List listContents(Namespace namespace, Content.Type type) { Review Comment: ```suggestion private List listContent(Namespace namespace, Content.Type type) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375097325 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java: ## @@ -355,21 +384,34 @@ public void renameTable(TableIdentifier from, TableIdentifier to) { // and removed by another. throw new RuntimeException( String.format( - "Cannot rename table '%s' to '%s': " + "ref '%s' no longer exists.", - from.name(), to.name(), getRef().getName()), + "Cannot rename %s '%s' to '%s': ref '%s' no longer exists.", + contentType, from, to, getRef().getName()), e); } catch (BaseNessieClientServerException e) { throw new CommitFailedException( e, - "Cannot rename table '%s' to '%s': " + "the current reference is not up to date.", - from.name(), - to.name()); + "Cannot rename %s '%s' to '%s': the current reference is not up to date.", + contentType, + from, + to); } catch (HttpClientException ex) { // Intentionally catch all nessie-client-exceptions here and not just the "timeout" variant // to catch all kinds of network errors (e.g. connection reset). Network code implementation // details and all kinds of network devices can induce unexpected behavior. So better be // safe than sorry. throw new CommitStateUnknownException(ex); +} catch (NessieBadRequestException ex) { + if (ex.getMessage().contains("already exists with content ID")) { Review Comment: same question: Is this an error msg that can be relied upon? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375097118 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java: ## @@ -355,21 +384,34 @@ public void renameTable(TableIdentifier from, TableIdentifier to) { // and removed by another. throw new RuntimeException( String.format( - "Cannot rename table '%s' to '%s': " + "ref '%s' no longer exists.", - from.name(), to.name(), getRef().getName()), + "Cannot rename %s '%s' to '%s': ref '%s' no longer exists.", + contentType, from, to, getRef().getName()), e); } catch (BaseNessieClientServerException e) { throw new CommitFailedException( e, - "Cannot rename table '%s' to '%s': " + "the current reference is not up to date.", - from.name(), - to.name()); + "Cannot rename %s '%s' to '%s': the current reference is not up to date.", + contentType, + from, + to); } catch (HttpClientException ex) { // Intentionally catch all nessie-client-exceptions here and not just the "timeout" variant // to catch all kinds of network errors (e.g. connection reset). Network code implementation // details and all kinds of network devices can induce unexpected behavior. So better be // safe than sorry. throw new CommitStateUnknownException(ex); +} catch (NessieBadRequestException ex) { + if (ex.getMessage().contains("already exists with content ID")) { +if (isView) { Review Comment: I think something like this would be better here: ``` String errorMsg = isView ? "." : ".." throw new AlreadyExistsException(errorMsg) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375097962 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java: ## @@ -378,27 +420,63 @@ public void renameTable(TableIdentifier from, TableIdentifier to) { // behavior. So better be safe than sorry. } + private void validateContentForRename( + TableIdentifier from, + TableIdentifier to, + boolean isView, + IcebergContent existingFromContent) { +if (existingFromContent == null) { + if (isView) { +throw new NoSuchViewException("View does not exist: %s", from); + } else { +throw new NoSuchTableException("Table does not exist: %s", from); + } +} + +IcebergContent existingToContent = isView ? view(to) : table(to); +if (existingToContent != null) { + if (isView) { +throw new AlreadyExistsException("Cannot rename %s to %s. View already exists", from, to); + } else { +throw new AlreadyExistsException("Cannot rename %s to %s. Table already exists", from, to); + } +} + } + public boolean dropTable(TableIdentifier identifier, boolean purge) { +return dropContent(identifier, purge, false); + } + + public boolean dropView(TableIdentifier identifier, boolean purge) { +return dropContent(identifier, purge, true); + } + + private boolean dropContent(TableIdentifier identifier, boolean purge, boolean isView) { Review Comment: rather than passing a boolean flag, maybe just pass the content type? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375098386 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieIcebergClient.java: ## @@ -540,4 +630,72 @@ public void close() { api.close(); } } + + public void commitView( + ViewMetadata base, + ViewMetadata metadata, + String newMetadataLocation, + String contentId, + ContentKey key) + throws NessieConflictException, NessieNotFoundException { +UpdateableReference updateableReference = getRef(); + +updateableReference.checkMutable(); + +Branch current = (Branch) updateableReference.getReference(); +Branch expectedHead = current; +if (base != null) { + String metadataCommitId = + base.properties() + .getOrDefault( + NessieTableOperations.NESSIE_COMMIT_ID_PROPERTY, expectedHead.getHash()); + if (metadataCommitId != null) { +expectedHead = Branch.of(expectedHead.getName(), metadataCommitId); + } +} + +long versionId = metadata.currentVersion().versionId(); + +ImmutableIcebergView.Builder newViewBuilder = ImmutableIcebergView.builder(); +// Directly casting to `SQLViewRepresentation` as only SQL type exist in +// `ViewRepresentation.Type`. +// Assuming only one engine's dialect will be used, Nessie IcebergView currently holds one +// representation. +// View loaded from catalog will have all the representation as it parses the view metadata +// file. +SQLViewRepresentation sqlViewRepresentation = +(SQLViewRepresentation) metadata.currentVersion().representations().get(0); Review Comment: it seems weird to only store one and not all of the representations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
nastra commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1375098623 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java: ## @@ -135,71 +135,26 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) { boolean newTable = base == null; String newMetadataLocation = writeNewMetadataIfRequired(newTable, metadata); -String refName = client.refName(); -boolean failure = false; +AtomicBoolean failure = new AtomicBoolean(false); Review Comment: why does this need to be `AtomicBoolean`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
danielcweeks commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375098695 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitReadyPayload.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.List; +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a worker that indicates it has finished sending all + * data for a commit request. + */ +public class CommitReadyPayload implements Payload { + + private UUID commitId; + private List assignments; + private final Schema avroSchema; + + private static final Schema AVRO_SCHEMA = + SchemaBuilder.builder() + .record(CommitReadyPayload.class.getName()) + .fields() + .name("commitId") + .prop(AvroSchemaUtil.FIELD_ID_PROP, 1100) Review Comment: I notice that we're using a wide range of field identifier values for the same fields. Is this because we're trying to deconflict with other Payloads/Elements (e.g. 1100, 1200, 1300)? I can see the need if the avro schema embeds another element (like `TableName`). Just wondering if there's anything we would really need to be concerned about by having a more explicit/consistent mapping (e.g. `FIELD_ID_PROP = 1000` everywhere). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375099924 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitReadyPayload.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.List; +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a worker that indicates it has finished sending all + * data for a commit request. + */ +public class CommitReadyPayload implements Payload { + + private UUID commitId; + private List assignments; + private final Schema avroSchema; + + private static final Schema AVRO_SCHEMA = + SchemaBuilder.builder() + .record(CommitReadyPayload.class.getName()) + .fields() + .name("commitId") + .prop(AvroSchemaUtil.FIELD_ID_PROP, 1100) Review Comment: I originally had it that way, with the same prop ID everywhere, as it isn't used for anything. I changed this to have unique field IDs to address one comment that having real field IDs would be helpful. I'm OK either way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
danielcweeks commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375100608 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java: ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import static java.util.stream.Collectors.toList; + +import java.util.Arrays; +import java.util.List; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.avro.util.Utf8; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.TableIdentifier; + +public class TableName implements Element { Review Comment: I'm a little late to the party here, but I like `TableReference` which I don't believe is currently used anywhere. Also, I think we might want to include the catalog name as well. That I feel could be helpful with the CommitCompletePayload especially since it can help disambiguate where commits are coming from if there are multiple catalogs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Honor Spark conf spark.sql.files.maxPartitionBytes in read split [iceberg]
jzhuge commented on PR #8922: URL: https://github.com/apache/iceberg/pull/8922#issuecomment-1783584994 The PR is ready for review. If approved, we will follow up with doc update and backports to 3.4, 3.3, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375104748 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java: ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import static java.util.stream.Collectors.toList; + +import java.util.Arrays; +import java.util.List; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.avro.util.Utf8; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.TableIdentifier; + +public class TableName implements Element { Review Comment: Yes, `TableReference` is great, I made this change. Currently the sink (which hasn't been submitted) only supports a single catalog, but I went ahead and made this change also. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
danielcweeks commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375105601 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they are informational and can be + * used by consumers to trigger downstream processes. + */ +public class CommitCompletePayload implements Payload { + + private UUID commitId; + private Long validThroughTs; + private final Schema avroSchema; Review Comment: Should we include any additional information in this (like a List of `TableName`s that were committed?). I think for this to be useful for downstream consumers, it has surprisingly little info. This might also be helpful because not all tables may get commits, right? Other information that could be useful (if available) would be stats (e.g. how may records were added. How many tables were updated). We can always add more to this, so think maybe the list of tables would be a good starting point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375106832 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they are informational and can be + * used by consumers to trigger downstream processes. + */ +public class CommitCompletePayload implements Payload { + + private UUID commitId; + private Long validThroughTs; + private final Schema avroSchema; Review Comment: There is a separate message that is sent for each individual table commit, this message is sent after all of the individual table commit messages have been sent, as a sort of terminator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
danielcweeks commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375106869 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they are informational and can be + * used by consumers to trigger downstream processes. + */ +public class CommitCompletePayload implements Payload { + + private UUID commitId; + private Long validThroughTs; + private final Schema avroSchema; Review Comment: Never mind, it looks like that info is part of `CommitTablePayload`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
danielcweeks commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375108167 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/TableName.java: ## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import static java.util.stream.Collectors.toList; + +import java.util.Arrays; +import java.util.List; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.avro.util.Utf8; +import org.apache.iceberg.catalog.Namespace; +import org.apache.iceberg.catalog.TableIdentifier; + +public class TableName implements Element { Review Comment: I was just thinking that there may be multiple syncs running for different catalogs and if they want something to process those events across multiple topics, it really helps to have the additional context. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
danielcweeks commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375108603 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitReadyPayload.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.List; +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a worker that indicates it has finished sending all + * data for a commit request. + */ +public class CommitReadyPayload implements Payload { + + private UUID commitId; + private List assignments; + private final Schema avroSchema; + + private static final Schema AVRO_SCHEMA = + SchemaBuilder.builder() + .record(CommitReadyPayload.class.getName()) + .fields() + .name("commitId") + .prop(AvroSchemaUtil.FIELD_ID_PROP, 1100) Review Comment: I'm fine with it, just wasn't sure if there was a reason behind the numbering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
danielcweeks commented on PR #8701: URL: https://github.com/apache/iceberg/pull/8701#issuecomment-1783596341 I'm a +1 on moving forward with this. I think there might still be an open question about Iceberg/Avro Schema definitions, but I'm fine with either resolution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
rdblue commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1375115365 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/CommitCompletePayload.java: ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.util.UUID; +import org.apache.avro.Schema; +import org.apache.avro.SchemaBuilder; +import org.apache.iceberg.avro.AvroSchemaUtil; + +/** + * A control event payload for events sent by a coordinator that indicates it has completed a commit + * cycle. Events with this payload are not consumed by the sink, they are informational and can be + * used by consumers to trigger downstream processes. + */ +public class CommitCompletePayload implements Payload { + + private UUID commitId; + private Long validThroughTs; + private final Schema avroSchema; + + private static final Schema AVRO_SCHEMA = + SchemaBuilder.builder() + .record(CommitCompletePayload.class.getName()) + .fields() + .name("commitId") + .prop(AvroSchemaUtil.FIELD_ID_PROP, 1000) Review Comment: ID 1000 is used for partition fields. Any time those are embedded in a data_file the partition should use those IDs. Might want to start these values at 10,000 instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]
amogh-jahagirdar commented on code in PR #8925: URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375117854 ## core/src/test/java/org/apache/iceberg/TestSplitPlanning.java: ## @@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() { "We should get one task per row group", 32, Iterables.size(scan.planTasks())); } + @Test + public void testSplitPlanningWithCorruptedOffsets() throws IOException { Review Comment: Discussed offline, this test exercises the path but has poor test hardness since the offsets end up being null anyways. For proper test hardness we can go through TableScanUtil.planTaskGroups which will call FileScanTask.split and exercise the split offset array path. The new test will fail without the fix since the number of tasks will be different (without the fix, there will be 2 tasks, 1 per split offset but the last task is a bad task. With the fix, the split size of 1 byte will be used for the 10 byte file, and there will be 10 tasks) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Ignore split offsets array when split offset is past file length [iceberg]
amogh-jahagirdar commented on code in PR #8925: URL: https://github.com/apache/iceberg/pull/8925#discussion_r1375118142 ## core/src/test/java/org/apache/iceberg/TestSplitPlanning.java: ## @@ -216,6 +217,34 @@ public void testSplitPlanningWithOffsets() { "We should get one task per row group", 32, Iterables.size(scan.planTasks())); } + @Test + public void testSplitPlanningWithCorruptedOffsets() throws IOException { +List files16Mb = newFilesWithInvalidOffset(16, 16 * 1024 * 1024, 2); +appendFiles(files16Mb); + +// There will be 4 tasks per file, instead of 2 tasks per file, since the offsets are invalid +// and will be ignored. +TableScan scan = +table +.newScan() +.option(TableProperties.SPLIT_OPEN_FILE_COST, String.valueOf(0)) +.option(TableProperties.SPLIT_SIZE, String.valueOf(4L * 1024 * 1024)); + +Assert.assertEquals( Review Comment: Done, this test is no longer applicable but for the new test it's using AssertJ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
jacobmarble commented on code in PR #8683: URL: https://github.com/apache/iceberg/pull/8683#discussion_r1375124724 ## format/spec.md: ## @@ -187,10 +189,11 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. Notes: 1. Decimal scale is fixed and cannot be changed by schema evolution. Precision can only be widened. -2. All time and timestamp values are stored with microsecond precision. -- Timestamps _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). -- Timestamps _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp values are stored as a long that encodes microseconds from the unix epoch. +2. `time`, `timestamp`, and `timestamptz` values are represented with _microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are represented with _nanosecond precision_. +- Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). +- Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 3. Character strings must be stored as UTF-8 encoded byte arrays. +4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables. Review Comment: Ah, I missed that detail. How's this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
jacobmarble commented on code in PR #8683: URL: https://github.com/apache/iceberg/pull/8683#discussion_r1375124724 ## format/spec.md: ## @@ -187,10 +189,11 @@ A **`map`** is a collection of key-value pairs with a key type and a value type. Notes: 1. Decimal scale is fixed and cannot be changed by schema evolution. Precision can only be widened. -2. All time and timestamp values are stored with microsecond precision. -- Timestamps _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). -- Timestamps _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). Timestamp values are stored as a long that encodes microseconds from the unix epoch. +2. `time`, `timestamp`, and `timestamptz` values are represented with _microsecond precision_. `timestamp_ns` and `timstamptz_ns` values are represented with _nanosecond precision_. +- Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). +- Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 3. Character strings must be stored as UTF-8 encoded byte arrays. +4. `timestamp_ns` and `timstamptz_ns` are only supported in v3 tables. Review Comment: Ah, I missed that detail. How's this? https://github.com/apache/iceberg/pull/8683/commits/e81df5543b82ac30dd4bb7f38b22e58f43f71a51 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] HadoopCatalog can't list toplevel tables [iceberg]
github-actions[bot] commented on issue #7130: URL: https://github.com/apache/iceberg/issues/7130#issuecomment-1783628604 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] HadoopCatalog can't list toplevel tables [iceberg]
github-actions[bot] closed issue #7130: HadoopCatalog can't list toplevel tables URL: https://github.com/apache/iceberg/issues/7130 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Unable to use GlueCatalog in flink environments without hadoop [iceberg]
rajcoolguy commented on issue #3044: URL: https://github.com/apache/iceberg/issues/3044#issuecomment-1783634807 @mgmarino - how did you make this work in KDA, lacking documentation, i am either landing in to linkage error due to hadoop jars though i have shaded and relocated hadoop classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] patch: Parquet Column Names with "Special Characters" fix [iceberg-python]
MarquisC commented on PR #109: URL: https://github.com/apache/iceberg-python/pull/109#issuecomment-1783661783 @Fokko thanks for the guidance and reference, I'll get things working with CI in a bit. Would I be able to have this assigned to me? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Use avro compression properties from table properties when writing manifests and manifest lists [iceberg]
wypoon commented on code in PR #6799: URL: https://github.com/apache/iceberg/pull/6799#discussion_r1375148065 ## core/src/main/java/org/apache/iceberg/ManifestFiles.java: ## @@ -157,11 +157,34 @@ public static ManifestWriter write(PartitionSpec spec, OutputFile outp */ public static ManifestWriter write( int formatVersion, PartitionSpec spec, OutputFile outputFile, Long snapshotId) { +return write(formatVersion, spec, outputFile, snapshotId, null, null); + } + + /** + * Create a new {@link ManifestWriter} for the given format version. + * + * @param formatVersion a target format version + * @param spec a {@link PartitionSpec} + * @param outputFile an {@link OutputFile} where the manifest will be written + * @param snapshotId a snapshot ID for the manifest entries, or null for an inherited ID + * @param compressionCodec compression codec for the manifest file + * @param compressionLevel compression level of the compressionCodec + * @return a manifest writer + */ + public static ManifestWriter write( Review Comment: @sumeetgajjar originally used a `Map` parameter for this , and @rdblue commented that "we don't want to pass a map of properties around. That's exposing too much where it doesn't need to be, and people tend to misuse generic arguments like this." What I propose to do then is to introduce a `ManifestWriter.Options` class and use that here (instead of a `Map`). I'll also introduce a `ManifestListWriter.Options` class and use that in `ManifestLists.write`. These `Options` classes define what additional parameters are applicable and may be set. If in future, additional parameters are needed, they can be added to these `Options` classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] added contributing.md file [iceberg-python]
onemriganka commented on PR #102: URL: https://github.com/apache/iceberg-python/pull/102#issuecomment-1783691176 > @onemriganka What do you think of pointing to https://py.iceberg.apache.org/contributing/? Otherwise we have to maintain this at two places YES ,https://py.iceberg.apache.org/contributing/ <- this is great .But i think as begineers only find usefull files like README.md etc only in github. so when we have contributing.md file in the repo will be easy for begineer to understand and contribute. professionals check website of the project but many begineer do not -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
stevenzwu commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1375163433 ## core/src/main/java/org/apache/iceberg/BaseFile.java: ## @@ -504,6 +508,27 @@ private static Map toReadableByteBufferMap(Map Map filterColumnsStats( + Map map, Set columnIds) { +if (columnIds == null || columnIds.isEmpty()) { + return SerializableMap.copyOf(map); +} + +if (map == null) { + return null; +} + +Map filtered = Maps.newHashMapWithExpectedSize(columnIds.size()); +for (Integer columnId : columnIds) { + TypeT value = map.get(columnId); + if (value != null) { Review Comment: thanks. double fetching is probably ok since the columnsToKeepStats is likely not big and Map get is a O(1) anyway. The alternative is to start a discussion and revisit the spec or Parquet implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: add nanosecond timestamp types [iceberg]
jacobmarble commented on PR #8683: URL: https://github.com/apache/iceberg/pull/8683#issuecomment-1783700784 > @jacobmarble indeed, thanks! do we expect any type promotions being allowed around the new types being added here? @findepi when we discussed in the last community meeting, that was discussed. We agreed to scope this PR, and #8657 to just adding the type, and let type promotion be a separate issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink: Add support for Flink 1.18 [iceberg]
YesOrNo828 commented on issue #8930: URL: https://github.com/apache/iceberg/issues/8930#issuecomment-1783712015 The list supported Flink versions for each connector: https://flink.apache.org/downloads/#apache-flink-connectors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark write abort result in table miss metadata location file [iceberg]
dyno commented on issue #8927: URL: https://github.com/apache/iceberg/issues/8927#issuecomment-1783720979 probably the error message just means the data file and maniefst files but not the metadata location file? i am sure the medata location file was gone and from s3 access log the deletion was initiated by the same cluster at around the same time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink: Add support for Flink 1.18 [iceberg]
pvary commented on issue #8930: URL: https://github.com/apache/iceberg/issues/8930#issuecomment-1783722052 > > then we will have a transitive dependency on Flink 1.18. We can exclude the dependency, but how can we make sure that everything works as expected without running the tests. > > You're right. There'll be a transitive dependency. If Flink1.17's APIs are incompatible with Flink1.18, I think we can copy some tests into the Flink1.17 module to ensure it works fine and add end-to-end tests to guarantee compatibility with different Flink versions I think the issue here, is that we can not be sure, which dependency needs to be copied over in advance. I tried to upgrade to Flink 1.18 yesterday, and found that there is a Flink change which causes the TestFlinkMetaDataTable#testAllFilesPartitioned test to fail. See: https://apache-flink.slack.com/archives/C03G7LJTS2G/p1698402761715879?thread_ts=1698402761.715879&cid=C03G7LJTS2G One can argue that these issues are found at upgrade time, but if we add features between migration, then we will not find them, only if we run all of the tests. So I think running all of the tests on every supported Flink version is a must. > > As an alternative we can move these classes around every time when it is needed - but this approach has its own maintenance costs. > > As of Flink 1.18, released a few days ago, the Flink community has externalized the [JDBC](https://github.com/apache/flink-connector-jdbc), [ES](https://github.com/apache/flink-connector-elasticsearch), [Kafka](https://github.com/apache/flink-connector-kafka), [pulsar](https://github.com/apache/flink-connector-pulsar), [HBase](https://github.com/apache/flink-connector-hbase), and so on connectors. That means the Flink API has become more compatible. So I think this maintenance cost between different Flink versions is acceptable. I am currently working on extending the Flink Sink API to provide features needed for Iceberg. See: - https://cwiki.apache.org/confluence/display/FLINK/FLIP-371%3A+Provide+initialization+context+for+Committer+creation+in+TwoPhaseCommittingSink - https://cwiki.apache.org/confluence/display/FLINK/FLIP-372%3A+Allow+TwoPhaseCommittingSink+WithPreCommitTopology+to+alter+the+type+of+the+Committable I would like to use them ASAP, and this will cause differences between 1.18/1.19 versions of the Iceberg connector. If we have flink-common, then until 1.18 this should be in common, after 1.19 it should go to the version specific code, and after we drop support for 1.18 then it will go to the common code again. Also we would need to come up with the appropriate abstractions for separate the changing code from the fix code. These issues might be even more pronounced when the Flink 2.0 comes out, of which the discussion is already started. IMHO these are more complex tasks than simply backporting the changes to the other branches. Maybe that is the cause why the Iceberg-Spark code is handled in the same way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark write abort result in table miss metadata location file [iceberg]
dyno commented on issue #8927: URL: https://github.com/apache/iceberg/issues/8927#issuecomment-1783724655 or the main problem is iceberg should not update metadata location in hive metastore before the write is actually completed. and the symptom is write is failed but the metadata location is updated so even the metadata location file is not deleted, the table is not in a consistent state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org