Re: [I] Move `_determine_partitions` to `pyarrow.py` [iceberg-python]
HonahX closed issue #896: Move `_determine_partitions` to `pyarrow.py` URL: https://github.com/apache/iceberg-python/issues/896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Rename `data_sequence_number` to `sequence_number` [iceberg-python]
HonahX closed issue #893: Rename `data_sequence_number` to `sequence_number` URL: https://github.com/apache/iceberg-python/issues/893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Namespace not exists after creating the namespace by REST [iceberg]
nastra closed issue #10656: Namespace not exists after creating the namespace by REST URL: https://github.com/apache/iceberg/issues/10656 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Namespace not exists after creating the namespace by REST [iceberg]
nastra commented on issue #10656: URL: https://github.com/apache/iceberg/issues/10656#issuecomment-210304 The tabular catalog doesn't support nested namespaces like `Namespace.of("public", "default")`. Given that this is a vendor-specific thing, I'll go ahead and close this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
lurnagao-dahua commented on code in PR #10661: URL: https://github.com/apache/iceberg/pull/10661#discussion_r1671665042 ## mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -144,21 +147,32 @@ public List getSplits(JobContext context) { InputFormatConfig.InMemoryDataModel model = conf.getEnum( InputFormatConfig.IN_MEMORY_DATA_MODEL, InputFormatConfig.InMemoryDataModel.GENERIC); -try (CloseableIterable tasksIterable = scan.planTasks()) { - Table serializableTable = SerializableTable.copyOf(table); - tasksIterable.forEach( - task -> { -if (applyResidual -&& (model == InputFormatConfig.InMemoryDataModel.HIVE -|| model == InputFormatConfig.InMemoryDataModel.PIG)) { - // TODO: We do not support residual evaluation for HIVE and PIG in memory data model - // yet - checkResiduals(task); -} -splits.add(new IcebergSplit(serializableTable, conf, task)); - }); -} catch (IOException e) { - throw new UncheckedIOException(String.format("Failed to close table scan: %s", scan), e); +final ExecutorService workerPool = +ThreadPools.newWorkerPool( +"iceberg-plan-worker-pool", +conf.getInt( +SystemConfigs.WORKER_THREAD_POOL_SIZE.propertyKey(), +ThreadPools.WORKER_THREAD_POOL_SIZE)); +try { + scan = scan.planWith(workerPool); Review Comment: > This test can also run in non-kerberos env, so i think you can use `org.apache.hadoop.security.UserGroupInformation` to login or get ugi info, then you don't need to launch the MiniKdc service. In the local file system, process users are used to read and write files, not ugi users. File permissions are determined by the permissions of the Linux system, so how to verify this issue? Moreover, work_pool will shut down after execution and cannot obtain the actual ugi in the thread,and the scope of the `planInputSplits `method is private. Can you provide further guidance on how to write this unit test case? Looking forward to your reply very much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
lurnagao-dahua commented on code in PR #10661: URL: https://github.com/apache/iceberg/pull/10661#discussion_r1673566838 ## mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -144,21 +147,32 @@ public List getSplits(JobContext context) { InputFormatConfig.InMemoryDataModel model = conf.getEnum( InputFormatConfig.IN_MEMORY_DATA_MODEL, InputFormatConfig.InMemoryDataModel.GENERIC); -try (CloseableIterable tasksIterable = scan.planTasks()) { - Table serializableTable = SerializableTable.copyOf(table); - tasksIterable.forEach( - task -> { -if (applyResidual -&& (model == InputFormatConfig.InMemoryDataModel.HIVE -|| model == InputFormatConfig.InMemoryDataModel.PIG)) { - // TODO: We do not support residual evaluation for HIVE and PIG in memory data model - // yet - checkResiduals(task); -} -splits.add(new IcebergSplit(serializableTable, conf, task)); - }); -} catch (IOException e) { - throw new UncheckedIOException(String.format("Failed to close table scan: %s", scan), e); +final ExecutorService workerPool = +ThreadPools.newWorkerPool( +"iceberg-plan-worker-pool", +conf.getInt( +SystemConfigs.WORKER_THREAD_POOL_SIZE.propertyKey(), +ThreadPools.WORKER_THREAD_POOL_SIZE)); +try { + scan = scan.planWith(workerPool); Review Comment: > This test can also run in non-kerberos env, so i think you can use `org.apache.hadoop.security.UserGroupInformation` to login or get ugi info, then you don't need to launch the MiniKdc service. Thank you for your reply! Sorry for bothering you. May I ask if you could review my code again when you have some free time? I would really appreciate it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
lurnagao-dahua commented on code in PR #10661: URL: https://github.com/apache/iceberg/pull/10661#discussion_r1671717882 ## mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -144,21 +147,32 @@ public List getSplits(JobContext context) { InputFormatConfig.InMemoryDataModel model = conf.getEnum( InputFormatConfig.IN_MEMORY_DATA_MODEL, InputFormatConfig.InMemoryDataModel.GENERIC); -try (CloseableIterable tasksIterable = scan.planTasks()) { - Table serializableTable = SerializableTable.copyOf(table); - tasksIterable.forEach( - task -> { -if (applyResidual -&& (model == InputFormatConfig.InMemoryDataModel.HIVE -|| model == InputFormatConfig.InMemoryDataModel.PIG)) { - // TODO: We do not support residual evaluation for HIVE and PIG in memory data model - // yet - checkResiduals(task); -} -splits.add(new IcebergSplit(serializableTable, conf, task)); - }); -} catch (IOException e) { - throw new UncheckedIOException(String.format("Failed to close table scan: %s", scan), e); +final ExecutorService workerPool = +ThreadPools.newWorkerPool( +"iceberg-plan-worker-pool", +conf.getInt( +SystemConfigs.WORKER_THREAD_POOL_SIZE.propertyKey(), +ThreadPools.WORKER_THREAD_POOL_SIZE)); +try { + scan = scan.planWith(workerPool); Review Comment: Thank you for your reply! the scope of the planInputSplits method is private. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] iceberg-aws-bundle jar includes org.slf4j.LoggerFactory [iceberg]
nastra commented on issue #10534: URL: https://github.com/apache/iceberg/issues/10534#issuecomment-268080 > Typically, I'd expect iceberg-aws-bundle to declare a dependency on slf4j-api The general idea of `iceberg-aws-bundle` is to be a fat jar that bundles everything related to AWS dependencies, meaning that such a jar wouldn't declare any other dependencies. Have you tried excluding slf4j when you consume `iceberg-aws-bundle` as a dependency via `exclude group: 'org.slf4j'`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
lurnagao-dahua commented on code in PR #10661: URL: https://github.com/apache/iceberg/pull/10661#discussion_r1670390955 ## mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -144,21 +147,32 @@ public List getSplits(JobContext context) { InputFormatConfig.InMemoryDataModel model = conf.getEnum( InputFormatConfig.IN_MEMORY_DATA_MODEL, InputFormatConfig.InMemoryDataModel.GENERIC); -try (CloseableIterable tasksIterable = scan.planTasks()) { - Table serializableTable = SerializableTable.copyOf(table); - tasksIterable.forEach( - task -> { -if (applyResidual -&& (model == InputFormatConfig.InMemoryDataModel.HIVE -|| model == InputFormatConfig.InMemoryDataModel.PIG)) { - // TODO: We do not support residual evaluation for HIVE and PIG in memory data model - // yet - checkResiduals(task); -} -splits.add(new IcebergSplit(serializableTable, conf, task)); - }); -} catch (IOException e) { - throw new UncheckedIOException(String.format("Failed to close table scan: %s", scan), e); +final ExecutorService workerPool = +ThreadPools.newWorkerPool( +"iceberg-plan-worker-pool", +conf.getInt( +SystemConfigs.WORKER_THREAD_POOL_SIZE.propertyKey(), +ThreadPools.WORKER_THREAD_POOL_SIZE)); +try { + scan = scan.planWith(workerPool); Review Comment: > Looks good! After this change, each fetch task will new its own thread pool, so the ugi problem is gone. > > But it would be good if you can try to add a unit test. Hi, Thank you for your reply! I have been thinking for a while and believe that the unit test case for obtaining the output results already exists. But if you want to verify the fixed bug, should me grant permissions to the user to the table? I'm not sure how to grant permissions in unit test cases, May I ask if you can give me some guidance on writing unit test. Looking forward to your reply very much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Declare avro as an api dependency of iceberg-core [iceberg]
nastra commented on PR #10573: URL: https://github.com/apache/iceberg/pull/10573#issuecomment-329253 Adding some additional context around `api` vs `implementation` for other reviewers: > Dependencies appearing in the api configurations will be transitively exposed to consumers of the library, and as such will appear on the compile classpath of consumers. Dependencies found in the implementation configuration will, on the other hand, not be exposed to consumers, and therefore not leak into the consumers' compile classpath -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] PyArrow: Don't enforce the schema [iceberg-python]
Fokko commented on code in PR #902: URL: https://github.com/apache/iceberg-python/pull/902#discussion_r1673729169 ## tests/integration/test_deletes.py: ## @@ -291,7 +291,7 @@ def test_partitioned_table_positional_deletes_sequence_number(spark: SparkSessio assert snapshots[2].summary == Summary( Operation.OVERWRITE, **{ -"added-files-size": "1145", +"added-files-size": "1125", Review Comment: Eagle eye, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
Fokko commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1673737827 ## pyiceberg/io/pyarrow.py: ## @@ -937,7 +937,7 @@ def primitive(self, primitive: pa.DataType) -> PrimitiveType: else: raise TypeError(f"Unsupported precision for timestamp type: {primitive.unit}") -if primitive.tz == "UTC" or primitive.tz == "+00:00": +if primitive.tz in ("UTC", "+00:00", "Etc/UTC"): Review Comment: Supernit: ```suggestion if primitive.tz in {"UTC", "+00:00", "Etc/UTC"}: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
findepi commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673751208 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + + Review Comment: Both messages are clear and actionable. My preference would be not to add configuration that is not needed. I noticed we do set this for other checks, and I believe we can remove those other configurations too, in places where they do not add much value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
findepi commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673753534 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + Review Comment: to leverage possessive quantifiers replace ``` ^[A-Z][A-Z0-9]*(_[A-Z0-9]+)*$ ``` with ``` ^[A-Z][A-Z0-9]*+(_[A-Z0-9]++)*+$ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] DynConstructors cleanup [iceberg]
findepi commented on code in PR #10542: URL: https://github.com/apache/iceberg/pull/10542#discussion_r1673801761 ## common/src/test/java/org/apache/iceberg/common/TestDynConstructors.java: ## @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.common; + +import static org.assertj.core.api.Assertions.assertThat; +import static org.assertj.core.api.Assertions.assertThatThrownBy; + +import org.junit.jupiter.api.Test; + +public class TestDynConstructors { + @Test + public void testImplNewInstance() throws Exception { +DynConstructors.Ctor ctor = +DynConstructors.builder().impl(MyClass.class).buildChecked(); +assertThat(ctor.newInstance()).isInstanceOf(MyClass.class); + } + + @Test + public void testInterfaceImplNewInstance() throws Exception { +DynConstructors.Ctor ctor = +DynConstructors.builder(MyInterface.class) +.impl("org.apache.iceberg.common.TestDynConstructors$MyClass") +.buildChecked(); +assertThat(ctor.newInstance()).isInstanceOf(MyClass.class); + } + + @Test + public void testInterfaceWrongImplString() throws Exception { +DynConstructors.Ctor ctor = +DynConstructors.builder(MyInterface.class) +// TODO this should throw, since the MyUnrelatedClass does not implement MyInterface + .impl("org.apache.iceberg.common.TestDynConstructors$MyUnrelatedClass") +.buildChecked(); +assertThatThrownBy(ctor::newInstance) +.isInstanceOf(ClassCastException.class) +.hasMessage( +"org.apache.iceberg.common.TestDynConstructors$MyUnrelatedClass cannot be cast to org.apache.iceberg.common.TestDynConstructors$MyInterface"); + } + + @Test + public void testInterfaceWrongImplClass() throws Exception { +DynConstructors.Ctor ctor = +DynConstructors.builder(MyInterface.class) +// TODO this should throw or not compile at all, since the MyUnrelatedClass does not Review Comment: I can give this a try @rdblue . one concern is that the DynConstructors and other Dyn* classes appear to be considered a part of iceberg-common API, so changes around generics would technically be breaking changes. @rdblue how would you recommend going about this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] PyArrow: Don't enforce the schema [iceberg-python]
Fokko merged PR #902: URL: https://github.com/apache/iceberg-python/pull/902 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Clarify which columns can be used for equality delete files. [iceberg]
Fokko commented on code in PR #8981: URL: https://github.com/apache/iceberg/pull/8981#discussion_r1673839996 ## format/spec.md: ## @@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` then `pos` to optimize Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row. -Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files. +Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). The column restrictions for columns used in equality delete files are the same as those for [identifier fields](#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a +parent struct column is null it implies the leaf column is null). Review Comment: Nit: removes the linebreak ```suggestion Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). The column restrictions for columns used in equality delete files are the same as those for [identifier fields](#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a parent struct column is null it implies the leaf column is null). ``` ## format/spec.md: ## @@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` then `pos` to optimize Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row. -Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files. +Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). The column restrictions for columns used in equality delete files are the same as those for [identifier fields](#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a +parent struct column is null it implies the leaf column is null). Review Comment: Thanks for the clarification, this makes sense to me 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Clarify time travel implementation in Iceberg [iceberg]
Fokko commented on code in PR #8982: URL: https://github.com/apache/iceberg/pull/8982#discussion_r1673841605 ## format/spec.md: ## @@ -1370,3 +1370,16 @@ Writing v2 metadata: * `sort_columns` was removed Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. + +## Appendix F: Implementation Notes + +This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification +to help maintain a uniform experience. + +### Point in Time Reads (Time Travel) + +Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories Review Comment: Gentle ping @aokolnychyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Clarify time travel implementation in Iceberg [iceberg]
Fokko commented on code in PR #8982: URL: https://github.com/apache/iceberg/pull/8982#discussion_r1673855957 ## format/spec.md: ## @@ -1370,3 +1370,16 @@ Writing v2 metadata: * `sort_columns` was removed Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. + +## Appendix F: Implementation Notes + +This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification +to help maintain a uniform experience. + +### Point in Time Reads (Time Travel) + +Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories +might indicate different snapshot IDs for a specific timestamp. The discrepencies can be caused by a variety of table operations (e.g. branch-merge table workflows or forcing the current state of table to specific snapshot ID). Review Comment: I think this explains it all, after reading the comment https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420: ```suggestion might indicate different snapshot IDs for a specific timestamp. The discrepancies can be caused by a variety of table operations (e.g. updating the `current-snapshot-id` of the table). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Clarify time travel implementation in Iceberg [iceberg]
Fokko commented on code in PR #8982: URL: https://github.com/apache/iceberg/pull/8982#discussion_r1673855957 ## format/spec.md: ## @@ -1370,3 +1370,16 @@ Writing v2 metadata: * `sort_columns` was removed Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. + +## Appendix F: Implementation Notes + +This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification +to help maintain a uniform experience. + +### Point in Time Reads (Time Travel) + +Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories +might indicate different snapshot IDs for a specific timestamp. The discrepencies can be caused by a variety of table operations (e.g. branch-merge table workflows or forcing the current state of table to specific snapshot ID). Review Comment: I think this explains it all (to me at least :), after reading the comment https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420: ```suggestion might indicate different snapshot IDs for a specific timestamp. The discrepancies can be caused by a variety of table operations (e.g. updating the `current-snapshot-id` of the table). ``` ## format/spec.md: ## @@ -1370,3 +1370,16 @@ Writing v2 metadata: * `sort_columns` was removed Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. + +## Appendix F: Implementation Notes + +This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification +to help maintain a uniform experience. + +### Point in Time Reads (Time Travel) + +Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories +might indicate different snapshot IDs for a specific timestamp. The discrepencies can be caused by a variety of table operations (e.g. branch-merge table workflows or forcing the current state of table to specific snapshot ID). Review Comment: I think this explains it all (to me at least :), after reading the comment on PR https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420: ```suggestion might indicate different snapshot IDs for a specific timestamp. The discrepancies can be caused by a variety of table operations (e.g. updating the `current-snapshot-id` of the table). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
lurnagao-dahua commented on PR #10661: URL: https://github.com/apache/iceberg/pull/10661#issuecomment-698145 @lurnagao-dahua -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support building with Java 21 [iceberg]
nastra commented on code in PR #10474: URL: https://github.com/apache/iceberg/pull/10474#discussion_r1673871100 ## baseline.gradle: ## @@ -46,7 +46,13 @@ subprojects { apply plugin: 'com.palantir.baseline-exact-dependencies' apply plugin: 'com.palantir.baseline-release-compatibility' // We need to update Google Java Format to 1.17.0+ to run spotless on JDK 8, but that requires dropping support for JDK 8. - if (JavaVersion.current() != JavaVersion.VERSION_21) { + if (JavaVersion.current() == JavaVersion.VERSION_21) { +task spotlessApply { + doLast { +throw new GradleException("Spotless plugin is currently disabled when running on JDK 21 (until we drop JDK 8). To run spotlessApply please use a different JDK version.") Review Comment: thanks for adding that msg, that's going to be helpful -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
attilakreiner commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673884503 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + Review Comment: Thanks, I see. I experimented a bit with the possessive quantifiers in the regexes as well as removing some custom messages that seem redundant. I created a draft PR with these changes: #10681. Even though it seems like just a few lines of changes, it has an impact for the whole codebase and I don't want to hold up this one. (I'd prefer to keep the scope of this PR tight, really just renaming the constants to uppercase and enforcing that rule.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
attilakreiner commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673885080 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + + Review Comment: Got it. Pls see me previous comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
nastra commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673890383 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + Review Comment: I think for this PR we should just stick with the [default regex pattern of checkstyle](https://checkstyle.sourceforge.io/checks/naming/constantname.html), which I believe is what you had initially. Improving the regex can be done in a separate PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deprecate `oauth/tokens` endpoint [iceberg]
jbonofre commented on code in PR #10603: URL: https://github.com/apache/iceberg/pull/10603#discussion_r1673893448 ## open-api/rest-catalog-open-api.yaml: ## @@ -134,9 +134,17 @@ paths: post: tags: - OAuth2 API - summary: Get a token using an OAuth2 flow + summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL) + deprecated: true operationId: getToken description: +The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ recommended to Review Comment: @danielcweeks I'm confused. I think there's no problem to merge the deprecation flag and message right now, correct ? So, I propose to merge this PR right now, and plan "removal" (impacting client) on 2.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Deprecate `oauth/tokens` endpoint [iceberg]
jbonofre commented on code in PR #10603: URL: https://github.com/apache/iceberg/pull/10603#discussion_r1673893448 ## open-api/rest-catalog-open-api.yaml: ## @@ -134,9 +134,17 @@ paths: post: tags: - OAuth2 API - summary: Get a token using an OAuth2 flow + summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL) + deprecated: true operationId: getToken description: +The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ recommended to Review Comment: @danielcweeks I'm confused. I think there's no problem to merge the deprecation flag and message right now, correct ? So, I propose to merge this PR right now, and plan "removal" (impacting client) on 2.0. @snazy wdyt about just updating the deprecation message to target 2.0.0 ? @danielcweeks maybe worth to start discussing 2.0.0 somehow ? 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
attilakreiner commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673906437 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + Review Comment: @nastra, yep, same thought here, hence the spin-off draft PR. In this PR the definition I used is this (this is the current state right now): `` the default you linked is this: `` So the only difference is I removed the first part the allows for the lowercase `log(ger)` as this codebase doesn't use it, so it was redundant. Also we have the custom message in place to be consistent with the rest of the definitions for now. Pls LMK if this is all good as it right now or we need some changes here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
attilakreiner commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673906437 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + Review Comment: @nastra, yep, same thought here, hence the spin-off draft PR. In this PR the definition I used is this (this is the current state right now): `` the default you linked is this: `` So the only difference is I removed the first part that allows for the lowercase `log(ger)` as this codebase doesn't use it, so it was redundant. Also we have the custom message in place to be consistent with the rest of the definitions for now. Pls LMK if this is all good as it right now or we need some changes here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
attilakreiner commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673906437 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + Review Comment: @nastra, yep, same thought here, hence the spin-off draft PR. In this PR the definition I used is this (this is the current state right now): `` the default you linked is this: `` So the only difference is I removed the first part that allows for the lowercase `log(ger)` as this codebase doesn't use it, so it was redundant. Also we have the custom message in place to be consistent with the rest of the definitions for now. Pls LMK if this is all good as it is right now or we need some changes here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
attilakreiner commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673906437 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + Review Comment: @nastra, yep, same thought here, hence the spin-off draft PR. In this PR the definition I used is this (this is the current state right now): `` the default you linked is this: `` So the only difference is I removed the first part that allows for the lowercase `log(ger)` as this codebase doesn't use it, so it was redundant. Also we have the custom message in place to be consistent with the rest of the definitions for now. Pls LMK if this is all good as it is right now or we need some changes here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Update Snapshot Retention Properties [iceberg-python]
chinmay-bhat opened a new pull request, #913: URL: https://github.com/apache/iceberg-python/pull/913 WIP - will be ready to review once rollback PR 758 is merged into main. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update checkstyle definition [iceberg]
findepi commented on code in PR #10681: URL: https://github.com/apache/iceberg/pull/10681#discussion_r1673961429 ## .baseline/checkstyle/checkstyle.xml: ## @@ -480,23 +475,20 @@ - + - - - + Review Comment: This one imo can't be possessive because of the trailing `T` requirement. For example `FooT` matched the previous pattern, but does not match the new pattern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
attilakreiner commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673885080 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + + Review Comment: Got it. Pls see my previous comment about https://github.com/apache/iceberg/pull/10681. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
attilakreiner commented on code in PR #10673: URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673885080 ## .baseline/checkstyle/checkstyle.xml: ## @@ -284,6 +284,10 @@ + + + Review Comment: Got it. Pls see my previous comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Expire Snapshots [iceberg-python]
chinmay-bhat opened a new pull request, #914: URL: https://github.com/apache/iceberg-python/pull/914 WIP - [x] initial skeleton - [ ] clean expired snapshots - [ ] write tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update checkstyle definition [iceberg]
attilakreiner commented on code in PR #10681: URL: https://github.com/apache/iceberg/pull/10681#discussion_r1673979787 ## .baseline/checkstyle/checkstyle.xml: ## @@ -480,23 +475,20 @@ - + - - - + Review Comment: You're right, thx for point it out. I reverted this one change. I double checked the other changes for similar issues and they LGTM. Pls chk and LMK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update checkstyle definition [iceberg]
attilakreiner commented on code in PR #10681: URL: https://github.com/apache/iceberg/pull/10681#discussion_r1673979787 ## .baseline/checkstyle/checkstyle.xml: ## @@ -480,23 +475,20 @@ - + - - - + Review Comment: You're right, thx for pointing it out. I reverted this one change. I double checked the other changes for similar issues and they LGTM. Pls chk and LMK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update checkstyle definition [iceberg]
findepi commented on code in PR #10681: URL: https://github.com/apache/iceberg/pull/10681#discussion_r1673988656 ## .baseline/checkstyle/checkstyle.xml: ## @@ -480,23 +475,20 @@ - + - - - + Review Comment: thanks. i also did check other changes. LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: don't include slf4j-api in bundled JARs [iceberg]
bryanck commented on PR #10665: URL: https://github.com/apache/iceberg/pull/10665#issuecomment-975393 I think this makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]
nastra merged PR #10673: URL: https://github.com/apache/iceberg/pull/10673 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]
nastra commented on PR #10682: URL: https://github.com/apache/iceberg/pull/10682#issuecomment-981956 @deniskuzZ can you add the missing `@Override` please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: don't include slf4j-api in bundled JARs [iceberg]
bryanck merged PR #10665: URL: https://github.com/apache/iceberg/pull/10665 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] iceberg-aws-bundle jar includes org.slf4j.LoggerFactory [iceberg]
bryanck closed issue #10534: iceberg-aws-bundle jar includes org.slf4j.LoggerFactory URL: https://github.com/apache/iceberg/issues/10534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix NPE during conflict handling of NULL partitions [iceberg]
deniskuzZ commented on code in PR #10680: URL: https://github.com/apache/iceberg/pull/10680#discussion_r1674048472 ## core/src/main/java/org/apache/iceberg/util/PartitionSet.java: ## @@ -200,7 +200,8 @@ public String toString() { StringBuilder partitionStringBuilder = new StringBuilder(); partitionStringBuilder.append(structType.fields().get(i).name()); partitionStringBuilder.append("="); - partitionStringBuilder.append(s.get(i, Object.class).toString()); + Object value = s.get(i, Object.class); + partitionStringBuilder.append(value == null ? "null" : value.toString()); Review Comment: please use `String.valueOf(s.get(i, Object.class))` instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] iceberg-aws-bundle jar includes org.slf4j.LoggerFactory [iceberg]
bryanck closed issue #10534: iceberg-aws-bundle jar includes org.slf4j.LoggerFactory URL: https://github.com/apache/iceberg/issues/10534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]
nastra commented on code in PR #10682: URL: https://github.com/apache/iceberg/pull/10682#discussion_r1674049469 ## core/src/main/java/org/apache/iceberg/SerializableTable.java: ## @@ -278,6 +278,10 @@ public TableScan newScan() { return lazyTable().newScan(); } + public IncrementalAppendScan newIncrementalAppendScan() { Review Comment: can you also override `newIncrementalChangelogScan()` here please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Spark timestamptz discrepancy [iceberg-python]
syun64 opened a new pull request, #915: URL: https://github.com/apache/iceberg-python/pull/915 I ran into this issue when I was trying to improve our tests for verifying write integrity for the timestamp types following our recent implementations for supporting inputs of other precisions and timezones. This Draft PR demonstrates a discrepancy with Spark Iceberg where TimestamptzType is being read as TimestampType versus in PyIceberg where it is correctly being read as TimestampType. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump mypy-boto3-glue from 1.34.136 to 1.34.143 [iceberg-python]
Fokko merged PR #912: URL: https://github.com/apache/iceberg-python/pull/912 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Detecting duplicates in the Flink Data Stream API [iceberg]
lkokhreidze opened a new issue, #10683: URL: https://github.com/apache/iceberg/issues/10683 ### Query engine Flink ### Question Hi, I was wondering if there's a way we could detect if ongoing batch written to the Iceberg table would perform the upsert? Context: I have an Iceberg table that is being written with Flink DataStream API with upsert enabled. Identifier field is configured to be the `ID` of the domain. My use case is to mark the rows that are upserted with an appropriate flag. And I was wondering if there's a way to get a hold of this information within the Data Stream API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Will not be merged: Spark timestamptz discrepancy [iceberg-python]
syun64 commented on PR #915: URL: https://github.com/apache/iceberg-python/pull/915#issuecomment-2223035079 Here's the dataframe schema that represents the Iceberg tables that are loaded through: 1. Spark Iceberg 2. PyIceberg Spark Iceberg loads both timestamptz and timestamp types as timestamp, whereas PyIceberg is able to distinguish these two types correctly ``` tests/integration/test_writes/test_writes.py:1063: AssertionError - Captured stdout call - timestamp_s datetime64[ns] timestamptz_s datetime64[us] timestamp_ms datetime64[ns] timestamptz_msdatetime64[us] timestamp_us datetime64[ns] timestamptz_usdatetime64[us] timestamp_ns datetime64[ns] timestamptz_nsdatetime64[us] timestamptz_us_etc_utcdatetime64[us] timestamptz_us_z datetime64[us] dtype: object timestamp_sdatetime64[us] timestamptz_s datetime64[us, UTC] timestamp_ms datetime64[us] timestamptz_msdatetime64[us, UTC] timestamp_us datetime64[us] timestamptz_usdatetime64[us, UTC] timestamp_ns datetime64[us] timestamptz_nsdatetime64[us, UTC] timestamptz_us_etc_utcdatetime64[us, UTC] timestamptz_us_z datetime64[us, UTC] dtype: object ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update checkstyle definition [iceberg]
attilakreiner commented on PR #10681: URL: https://github.com/apache/iceberg/pull/10681#issuecomment-2223052922 Merged in the `main` branch and fixed `ConstantName`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]
snazy commented on PR #10603: URL: https://github.com/apache/iceberg/pull/10603#issuecomment-2223065233 I've updated the PR to mention "2.0". The CI failures look unrelated, but I don't have the power to rerun those. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix NPE during conflict handling of NULL partitions [iceberg]
boroknagyz commented on code in PR #10680: URL: https://github.com/apache/iceberg/pull/10680#discussion_r1674098559 ## core/src/main/java/org/apache/iceberg/util/PartitionSet.java: ## @@ -200,7 +200,8 @@ public String toString() { StringBuilder partitionStringBuilder = new StringBuilder(); partitionStringBuilder.append(structType.fields().get(i).name()); partitionStringBuilder.append("="); - partitionStringBuilder.append(s.get(i, Object.class).toString()); + Object value = s.get(i, Object.class); + partitionStringBuilder.append(value == null ? "null" : value.toString()); Review Comment: Yeah, that's nicer. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]
epgif commented on code in PR #9008: URL: https://github.com/apache/iceberg/pull/9008#discussion_r1674134865 ## api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java: ## @@ -112,12 +112,6 @@ public void testSpecValues() { .as("Spec example: hash(2017-11-16T22:31:08) = -204791") .isEqualTo(-204791); -Literal timestamptzVal = Review Comment: Oops. Undid that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]
epgif commented on code in PR #9008: URL: https://github.com/apache/iceberg/pull/9008#discussion_r1674135249 ## api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java: ## @@ -165,6 +159,68 @@ public void testLong() { .isEqualTo(hashBytes(buffer.array())); } + @Test + public void testTimestampNanoPromotion() { +// Values from spec Appendix B: 32-bit Hash Requirements +String timestamp1 = "2017-11-16T22:31:08"; +long expectedHash1 = -204791; +String timestamp2 = "2017-11-16T22:31:08.01"; +long expectedHash2 = -1207196810; +String timestampNs1 = "2017-11-16T22:31:08"; +String timestampNs2 = "2017-11-16T22:31:08.01001"; + +Types.TimestampType tsType = Types.TimestampType.withoutZone(); +Types.TimestampNanoType tsNsType = Types.TimestampNanoType.withoutZone(); + +Bucket tsNsBucket = Bucket.get(tsNsType, 1); +Bucket tsBucket = Bucket.get(tsType, 1); + +assertThat(tsBucket.hash(Literal.of(timestamp1).to(tsType).value())) +.as("Timestamp and TimestampNano bucket results should match") Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]
epgif commented on code in PR #9008: URL: https://github.com/apache/iceberg/pull/9008#discussion_r1674135937 ## api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java: ## @@ -165,6 +159,68 @@ public void testLong() { .isEqualTo(hashBytes(buffer.array())); } + @Test + public void testTimestampNanoPromotion() { +// Values from spec Appendix B: 32-bit Hash Requirements +String timestamp1 = "2017-11-16T22:31:08"; +long expectedHash1 = -204791; +String timestamp2 = "2017-11-16T22:31:08.01"; +long expectedHash2 = -1207196810; +String timestampNs1 = "2017-11-16T22:31:08"; Review Comment: This is obviated by the requested inlining. ## api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java: ## @@ -165,6 +159,68 @@ public void testLong() { .isEqualTo(hashBytes(buffer.array())); } + @Test + public void testTimestampNanoPromotion() { +// Values from spec Appendix B: 32-bit Hash Requirements +String timestamp1 = "2017-11-16T22:31:08"; Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]
epgif commented on PR #9008: URL: https://github.com/apache/iceberg/pull/9008#issuecomment-2223123048 > overall this LGTM once comments in `TestBucketing` have been addressed I've addressed these. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Check if schema is compatible in `add_files` API [iceberg-python]
syun64 commented on code in PR #907: URL: https://github.com/apache/iceberg-python/pull/907#discussion_r1674162322 ## pyiceberg/io/pyarrow.py: ## @@ -2026,6 +2072,8 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_ f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids" ) schema = table_metadata.schema() +_check_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema()) Review Comment: Thank you for the catch @HonahX - I think you are right. Adding a nanosecond timestamp file doesn't correctly allow Spark Iceberg to read the file and instead results in exceptions like: ``` ValueError: year 53177 is out of range ``` I will make downcast_ns_timestamp_to_us_on_write an input argument to _check_schema_compatible, so that we can prevent nanoseconds timestamp types from being added through add_files, but can continue to support it being downcast in overwrite/append -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Commit coordination [iceberg]
danielcweeks commented on PR #10351: URL: https://github.com/apache/iceberg/pull/10351#issuecomment-2223231870 Thanks @bryanck and @fqaiser94. It's really great to get this one in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Commit coordination [iceberg]
danielcweeks merged PR #10351: URL: https://github.com/apache/iceberg/pull/10351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]
jackye1995 commented on PR #10603: URL: https://github.com/apache/iceberg/pull/10603#issuecomment-2223257678 Can you try rebase to see if it fixes the CI? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: add resultSchema() method to StructTransform [iceberg]
stevenzwu commented on PR #10496: URL: https://github.com/apache/iceberg/pull/10496#issuecomment-2223275495 hmm. the `TestSparkDataFile` failed after this change. ``` Caused by: org.apache.iceberg.exceptions.ValidationException: Invalid schema: multiple fields for name ts: 9 and 9 at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49) at org.apache.iceberg.types.IndexByName.addField(IndexByName.java:195) at org.apache.iceberg.types.IndexByName.field(IndexByName.java:161) at org.apache.iceberg.types.IndexByName.field(IndexByName.java:34) at org.apache.iceberg.types.TypeUtil.visit(TypeUtil.java:611) at org.apache.iceberg.types.TypeUtil.indexNameById(TypeUtil.java:173) at org.apache.iceberg.Schema.lazyIdToName(Schema.java:213) at org.apache.iceberg.Schema.(Schema.java:142) at org.apache.iceberg.Schema.(Schema.java:99) at org.apache.iceberg.Schema.(Schema.java:87) at org.apache.iceberg.Schema.(Schema.java:83) at org.apache.iceberg.StructTransform.(StructTransform.java:82) at org.apache.iceberg.PartitionKey.(PartitionKey.java:37) at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.(SparkWrite.java:774) ``` This test class apply two partition spec transformation (identify and hour) on the same field `ts`. Is that a legitimate use case? If it is yes, then we have a problem with this proposed change. ``` private static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA) .identity("b") .bucket("i", 2) .identity("l") .identity("f") .identity("d") .identity("date") .hour("ts") .identity("ts") .identity("tsntz") .truncate("s", 2) .identity("bytes") .bucket("dec_9_0", 2) .bucket("dec_11_2", 2) .bucket("dec_38_10", 2) .build(); ``` cc @rdblue @aokolnychyi @RussellSpitzer @szehon-ho -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]
danielcweeks commented on code in PR #10603: URL: https://github.com/apache/iceberg/pull/10603#discussion_r1674264095 ## open-api/rest-catalog-open-api.yaml: ## @@ -134,9 +134,17 @@ paths: post: tags: - OAuth2 API - summary: Get a token using an OAuth2 flow + summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL) + deprecated: true operationId: getToken description: +The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ recommended to Review Comment: I think we're good now. The issue was that the target removal was 1.7 instead of 2.0, which I think was just confusion about deprecation vs. removal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]
jbonofre commented on code in PR #10603: URL: https://github.com/apache/iceberg/pull/10603#discussion_r1674277456 ## open-api/rest-catalog-open-api.yaml: ## @@ -134,9 +134,17 @@ paths: post: tags: - OAuth2 API - summary: Get a token using an OAuth2 flow + summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL) + deprecated: true operationId: getToken description: +The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ recommended to Review Comment: @danielcweeks yes, that's why I was confused 😄 It seems we were "mixing" deprecation and removal. I agree it looks good now. Thanks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]
danielcweeks commented on code in PR #10603: URL: https://github.com/apache/iceberg/pull/10603#discussion_r1674279099 ## open-api/rest-catalog-open-api.yaml: ## @@ -134,9 +134,22 @@ paths: post: tags: - OAuth2 API - summary: Get a token using an OAuth2 flow + summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL) + deprecated: true operationId: getToken description: +The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ recommended to +implement this endpoint, unless you are fully aware of the potential security implications. + +All clients are encouraged to explicitly set the configuration property `oauth2-server-uri` +to the correct OAuth endpoint. + +Deprecated since Iceberg (Java) 1.6.0. The endpoint and related types will be removed from +this spec in Iceberg (Java) 2.0, whichever is released sooner. Review Comment: ```suggestion this spec in Iceberg (Java) 2.0. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: use bulk delete when removing old metadata.json files [iceberg]
dramaticlly commented on PR #10679: URL: https://github.com/apache/iceberg/pull/10679#issuecomment-2223373221 > I think this is fine, I mentioned this another similar PR though so I thought I would note it here as well. We need to make sure our test case now runs with Filesystems that both support and don't support BulkOperations otherwise we have an untested branch. > > > > Is that the case here? We may need to start parameterizing the tests again along the axis of File System Capabilities. I think so, I was using testMetadataFileLocationsRemovalAfterCommit in CatalogTests.java to verify the correctness of this change. It was parametrized to test various catalog, where in memory catalog is using in-memory fileIO (not support bulk deletion) and hive catalog is using Hadoop fileIO (support bulk deletion), plus various other catalogs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
syun64 commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674334862 ## pyiceberg/table/__init__.py: ## @@ -528,10 +528,6 @@ def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT) ) _check_schema_compatible(self._table.schema(), other_schema=df.schema) -# cast if the two schemas are compatible but not equal -table_arrow_schema = self._table.schema().as_arrow() -if table_arrow_schema != df.schema: -df = df.cast(table_arrow_schema) Review Comment: Removed this cast function - `to_requested_schema` should be responsible for casting the types to their desired schema, instead of casting it here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
syun64 commented on PR #910: URL: https://github.com/apache/iceberg-python/pull/910#issuecomment-2223408958 @Fokko @HonahX - thank you for your reviews. I've updated the integration test to make the [check more comprehensive](https://github.com/apache/iceberg-python/pull/910/files#diff-7f3dd1244d08ce27c003cd091da10aa049f7bb0c7d5397acb4ec69767036accdR1063-R1074), and removed the [pyarrow table casting](https://github.com/apache/iceberg-python/pull/910/files#diff-23e8153e0fd497a9212215bd2067068f3b56fa071770c7ef326db3d3d03cee9bL590-L593) in `append` and `overwrite` so that each column can be cast with more targeted `safe` flag for each field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
syun64 commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674351468 ## pyiceberg/io/pyarrow.py: ## @@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: and pa.types.is_timestamp(values.type) and values.type.unit == "ns" ): -return values.cast(target_type, safe=False) +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type, safe=False) +if ( +pa.types.is_timestamp(target_type) +and target_type.unit == "us" +and pa.types.is_timestamp(values.type) +and values.type.unit in {"s", "ms", "us"} +): +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type) Review Comment: Does this look correct that we are casting the types in order to follow the Iceberg Spec for Parquet Physical and Logical Types? https://iceberg.apache.org/spec/#parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Upgrade to Gradle 8.9 [iceberg]
jbonofre commented on issue #10685: URL: https://github.com/apache/iceberg/issues/10685#issuecomment-2223424034 @Fokko @nastra I'm working on the gradle update PR. Do you mind guys to assign this ticket to me and set the milestone to 1.6.0 ? Thanks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Upgrade to Gradle 8.9 [iceberg]
jbonofre commented on PR #10686: URL: https://github.com/apache/iceberg/pull/10686#issuecomment-2223435089 This PR closes #10686 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Upgrade to Gradle 8.9 [iceberg]
jbonofre commented on PR #10686: URL: https://github.com/apache/iceberg/pull/10686#issuecomment-2223435767 @Fokko @nastra thanks in advance gentlemen ! 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Running MERGE INTO with more than one WHEN condition fails if the number of columns in the target table is > 321 [iceberg]
andreaschiappacasse commented on issue #10294: URL: https://github.com/apache/iceberg/issues/10294#issuecomment-2223558105 @krishan711 we ended up using spark instead of athena to do the upsert/delete operation. It is still very unfortunate because it is much more expensive and adds some complexity to our architecture, but just in case it may help you... that's what our team did in our case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]
deniskuzZ commented on code in PR #10682: URL: https://github.com/apache/iceberg/pull/10682#discussion_r1674500091 ## core/src/main/java/org/apache/iceberg/SerializableTable.java: ## @@ -278,6 +278,10 @@ public TableScan newScan() { return lazyTable().newScan(); } + public IncrementalAppendScan newIncrementalAppendScan() { Review Comment: thanks @nastra for the review! added missing `@Override` annotation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]
deniskuzZ commented on code in PR #10682: URL: https://github.com/apache/iceberg/pull/10682#discussion_r1674506153 ## core/src/main/java/org/apache/iceberg/SerializableTable.java: ## @@ -278,6 +278,10 @@ public TableScan newScan() { return lazyTable().newScan(); } + public IncrementalAppendScan newIncrementalAppendScan() { Review Comment: `+` newIncrementalChangelogScan() override -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]
deniskuzZ commented on code in PR #10682: URL: https://github.com/apache/iceberg/pull/10682#discussion_r1674506153 ## core/src/main/java/org/apache/iceberg/SerializableTable.java: ## @@ -278,6 +278,10 @@ public TableScan newScan() { return lazyTable().newScan(); } + public IncrementalAppendScan newIncrementalAppendScan() { Review Comment: + newIncrementalChangelogScan() override -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Upgrade to Gradle 8.9 [iceberg]
Fokko merged PR #10686: URL: https://github.com/apache/iceberg/pull/10686 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Upgrade to Gradle 8.9 [iceberg]
Fokko closed issue #10685: Upgrade to Gradle 8.9 URL: https://github.com/apache/iceberg/issues/10685 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Upgrade to Gradle 8.9 [iceberg]
Fokko commented on PR #10686: URL: https://github.com/apache/iceberg/pull/10686#issuecomment-2223698571 Thanks @jbonofre for bumping Gradle here, and thanks @nastra, @ajantha-bhat and @snazy for the prompt review 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]
snazy commented on PR #10603: URL: https://github.com/apache/iceberg/pull/10603#issuecomment-2223703737 > Can you try rebase to see if it fixes the CI? CI looking good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
deniskuzZ commented on PR #10661: URL: https://github.com/apache/iceberg/pull/10661#issuecomment-2223715282 LGTM +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
deniskuzZ commented on code in PR #10661: URL: https://github.com/apache/iceberg/pull/10661#discussion_r1674554724 ## mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java: ## @@ -381,6 +386,46 @@ public void testCustomCatalog() throws IOException { testInputFormat.create(builder.conf()).validate(expectedRecords); } + @TestTemplate + public void testWorkerPool() throws Exception { +// 1.The ugi in the same thread will not change +final ExecutorService workerPool1 = ThreadPools.newWorkerPool("iceberg-plan-worker-pool", 1); +UserGroupInformation user1 = +UserGroupInformation.createUserForTesting("user1", new String[] {}); +UserGroupInformation user2 = +UserGroupInformation.createUserForTesting("user2", new String[] {}); +assertThat(setAndGetUgi(user1, workerPool1)).isEqualTo("user1"); +assertThat(setAndGetUgi(user2, workerPool1)).isEqualTo("user1"); +workerPool1.shutdown(); + +// 2.The ugi in different threads will be different +final ExecutorService workerPool2 = ThreadPools.newWorkerPool("iceberg-plan-worker-pool", 1); +assertThat(setAndGetUgi(user2, workerPool2)).isEqualTo("user2"); +workerPool2.shutdown(); + } + + private String setAndGetUgi(UserGroupInformation ugi, ExecutorService workpool) Review Comment: @lurnagao-dahua, the test looks good to me, thanks for adding it! Unfortunately, I don't have committer rights in the iceberg project, but I'll add my +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
deniskuzZ commented on code in PR #10661: URL: https://github.com/apache/iceberg/pull/10661#discussion_r1674558387 ## mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java: ## @@ -381,6 +386,56 @@ public void testCustomCatalog() throws IOException { testInputFormat.create(builder.conf()).validate(expectedRecords); } + @TestTemplate + public void testWorkerPool() throws Exception { +Table table = helper.createUnpartitionedTable(); +List records = helper.generateRandomRecords(1, 0L); +helper.appendToTable(null, records); Review Comment: do we really need to inject some data? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
lurnagao-dahua commented on code in PR #10661: URL: https://github.com/apache/iceberg/pull/10661#discussion_r1674565642 ## mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java: ## @@ -381,6 +386,56 @@ public void testCustomCatalog() throws IOException { testInputFormat.create(builder.conf()).validate(expectedRecords); } + @TestTemplate + public void testWorkerPool() throws Exception { +Table table = helper.createUnpartitionedTable(); +List records = helper.generateRandomRecords(1, 0L); +helper.appendToTable(null, records); Review Comment: I think it may only inserting data can create thread in workerpool and and initial ugi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] mr:Fix issues 10639 [iceberg]
deniskuzZ commented on code in PR #10661: URL: https://github.com/apache/iceberg/pull/10661#discussion_r1674568873 ## mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java: ## @@ -381,6 +386,56 @@ public void testCustomCatalog() throws IOException { testInputFormat.create(builder.conf()).validate(expectedRecords); } + @TestTemplate + public void testWorkerPool() throws Exception { +Table table = helper.createUnpartitionedTable(); +List records = helper.generateRandomRecords(1, 0L); +helper.appendToTable(null, records); Review Comment: 👍 , i thought we could save some CPU cycles here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Check if schema is compatible in `add_files` API [iceberg-python]
Fokko commented on code in PR #907: URL: https://github.com/apache/iceberg-python/pull/907#discussion_r1674603748 ## pyiceberg/io/pyarrow.py: ## @@ -2026,6 +2072,8 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_ f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids" ) schema = table_metadata.schema() +_check_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema()) Review Comment: In general, I think it is okay to be able to read more broadly. We do need tests to ensure that it works correctly. Looking at Arrow, there are already some physical types that we don't support (`date64`, etc). In Java, we do support reading Timestamps that are encoded using `INT96`, we should not produce them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Check if schema is compatible in `add_files` API [iceberg-python]
Fokko commented on code in PR #907: URL: https://github.com/apache/iceberg-python/pull/907#discussion_r1674607135 ## pyiceberg/io/pyarrow.py: ## @@ -166,6 +166,7 @@ ONE_MEGABYTE = 1024 * 1024 BUFFER_SIZE = "buffer-size" + Review Comment: ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
Fokko commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674682639 ## pyiceberg/io/pyarrow.py: ## @@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: and pa.types.is_timestamp(values.type) and values.type.unit == "ns" ): -return values.cast(target_type, safe=False) +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type, safe=False) +if ( +pa.types.is_timestamp(target_type) +and target_type.unit == "us" +and pa.types.is_timestamp(values.type) +and values.type.unit in {"s", "ms", "us"} +): +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type) Review Comment: Yes, it looks good. Arrow writes `INT64` by default for timestamps: https://arrow.apache.org/docs/cpp/parquet.html#logical-types -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
Fokko commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674684198 ## pyiceberg/io/pyarrow.py: ## @@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: and pa.types.is_timestamp(values.type) and values.type.unit == "ns" ): -return values.cast(target_type, safe=False) +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type, safe=False) +if ( +pa.types.is_timestamp(target_type) +and target_type.unit == "us" +and pa.types.is_timestamp(values.type) +and values.type.unit in {"s", "ms", "us"} +): +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: Review Comment: Can we add parentheses? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
Fokko commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674686353 ## pyiceberg/io/pyarrow.py: ## @@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: and pa.types.is_timestamp(values.type) and values.type.unit == "ns" ): -return values.cast(target_type, safe=False) +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type, safe=False) +if ( +pa.types.is_timestamp(target_type) +and target_type.unit == "us" +and pa.types.is_timestamp(values.type) +and values.type.unit in {"s", "ms", "us"} Review Comment: When we're requesting `us`and the file provide a `us`, I don't think we need to cast? ```suggestion and values.type.unit in {"s", "ms"} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
syun64 commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674689843 ## pyiceberg/io/pyarrow.py: ## @@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: and pa.types.is_timestamp(values.type) and values.type.unit == "ns" ): -return values.cast(target_type, safe=False) +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type, safe=False) +if ( +pa.types.is_timestamp(target_type) +and target_type.unit == "us" +and pa.types.is_timestamp(values.type) +and values.type.unit in {"s", "ms", "us"} Review Comment: Sorry this is a bit confusing, I agree 😓 . I included it here because I wanted to support casting `pa.timestamp('us', tz='Etc/UTC')` to `pa.timestamp('us', tz='UTC')` within the same condition. I think we won't hit this condition if both the input and requested types are `pa.timestamp('us')` because we enter this block only if target_type and values.type are not equal: https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1315 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support Spark Column Stats [iceberg]
singhpk234 commented on code in PR #10659: URL: https://github.com/apache/iceberg/pull/10659#discussion_r1674683253 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ## @@ -175,7 +181,25 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Maps.newHashMap()); +} + +Map map = Maps.newHashMap(); + +if (readConf.enableColumnStats()) { Review Comment: [Q] NDV can't be used for non CBO cases, do we need to couple this flag by checking cbo is enabled addtionally as well `spark.sql.cbo.enabled` other wise SizeInBytesOnlyStatsPlanVisitor will be used ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
syun64 commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674691317 ## pyiceberg/io/pyarrow.py: ## @@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: and pa.types.is_timestamp(values.type) and values.type.unit == "ns" ): -return values.cast(target_type, safe=False) +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type, safe=False) +if ( +pa.types.is_timestamp(target_type) +and target_type.unit == "us" +and pa.types.is_timestamp(values.type) +and values.type.unit in {"s", "ms", "us"} +): +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: Review Comment: Added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support Spark Column Stats [iceberg]
szehon-ho commented on code in PR #10659: URL: https://github.com/apache/iceberg/pull/10659#discussion_r1674693714 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java: ## @@ -97,6 +117,36 @@ public static Object[][] parameters() { }; } + @BeforeAll Review Comment: Why is it needed? ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ## @@ -175,7 +181,25 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Maps.newHashMap()); +} + +Map map = Maps.newHashMap(); + +if (readConf.enableColumnStats()) { + List files = table.statisticsFiles(); + if (!files.isEmpty()) { +List metadataList = (files.get(0)).blobMetadata(); + +for (BlobMetadata blobMetadata : metadataList) { + long ndv = Long.parseLong(blobMetadata.properties().get("ndv")); + ColumnStatistics colStats = new ColStats(ndv, null, null, 0L, 0L, 0L, null); Review Comment: Should we fill the other values from the manifest file? ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ## @@ -175,7 +181,25 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Maps.newHashMap()); +} + +Map map = Maps.newHashMap(); Review Comment: Can we have a more descriptive name? ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java: ## @@ -175,7 +181,25 @@ public Statistics estimateStatistics() { protected Statistics estimateStatistics(Snapshot snapshot) { // its a fresh table, no data if (snapshot == null) { - return new Stats(0L, 0L); + return new Stats(0L, 0L, Maps.newHashMap()); +} + +Map map = Maps.newHashMap(); + +if (readConf.enableColumnStats()) { + List files = table.statisticsFiles(); + if (!files.isEmpty()) { +List metadataList = (files.get(0)).blobMetadata(); + +for (BlobMetadata blobMetadata : metadataList) { + long ndv = Long.parseLong(blobMetadata.properties().get("ndv")); Review Comment: I think we should be more defensive if there is no value, or numberformatexception? ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java: ## @@ -90,4 +90,8 @@ private SparkSQLProperties() {} public static final String EXECUTOR_CACHE_LOCALITY_ENABLED = "spark.sql.iceberg.executor-cache.locality.enabled"; public static final boolean EXECUTOR_CACHE_LOCALITY_ENABLED_DEFAULT = false; + + // Controls whether column statistics are enabled when estimating statistics Review Comment: the comment seems not meaningful at first read. Can we be more precise? ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/ColStats.java: ## @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.spark.source; + +import java.util.Optional; +import java.util.OptionalLong; +import org.apache.spark.sql.connector.read.colstats.ColumnStatistics; +import org.apache.spark.sql.connector.read.colstats.Histogram; + +class ColStats implements ColumnStatistics { Review Comment: Nit: this is a bit confusing, ColStats is just abbreviate the Spark class name. How about SparkColumnStatistics?(Looks like in the package there are a lot of class called SparkXXX in this situation) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubs
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
Fokko commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674710445 ## pyiceberg/io/pyarrow.py: ## @@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: and pa.types.is_timestamp(values.type) and values.type.unit == "ns" ): -return values.cast(target_type, safe=False) +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type, safe=False) +if ( +pa.types.is_timestamp(target_type) +and target_type.unit == "us" +and pa.types.is_timestamp(values.type) +and values.type.unit in {"s", "ms", "us"} Review Comment: I think we might need to do some more work here. In Iceberg we're rather strict on the distinction between Timestamp and TimestampTZ. A good way of showing this can be found here: https://github.com/apache/iceberg-python/blob/aceed2ad9898eaa8b0c883a195e58af5d0e3/pyiceberg/expressions/literals.py#L566-L572 This is when we parse a string from a literal, which often comes an expression: `dt >= '1925-05-22T00:00:00'`. If the value has a timezone, even UTC, we reject it as `Timestamp`. When it has a timestamp, we [normalize it to UTC and then store the integer](https://github.com/apache/iceberg-python/blob/aceed2ad9898eaa8b0c883a195e58af5d0e3/pyiceberg/utils/datetime.py#L75). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support writing to a branch [iceberg-python]
kevinjqliu commented on issue #306: URL: https://github.com/apache/iceberg-python/issues/306#issuecomment-2224059233 @vinjai yes! please go ahead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [Docs] Add examples for DataFrame branch writes [iceberg]
szehon-ho commented on code in PR #10644: URL: https://github.com/apache/iceberg/pull/10644#discussion_r1674784383 ## docs/docs/spark-writes.md: ## @@ -332,6 +332,30 @@ The writer must enable the `mergeSchema` option. ```scala data.writeTo("prod.db.sample").option("mergeSchema","true").append() ``` + +### Writing to Branches +The branch must exist before performing write. The operation does **not** create the branch if it does not exist. Review Comment: Initially, the "Writing to Branches" seemed a first level header (along with "Writing with SQL" and "Writing with Dataframes"), was it intentional? Do you think we can just add an example using Dataframe here? My thought is, then we dont have to repeat this block of text, which we will have to do if we downgrade it to subsections within "Writing with SQL" and "Writing with Dataframes". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]
syun64 commented on code in PR #910: URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674797591 ## pyiceberg/io/pyarrow.py: ## @@ -1296,31 +1297,49 @@ def to_requested_schema( class ArrowProjectionVisitor(SchemaWithPartnerVisitor[pa.Array, Optional[pa.Array]]): -file_schema: Schema +_file_schema: Schema _include_field_ids: bool +_downcast_ns_timestamp_to_us: bool def __init__(self, file_schema: Schema, downcast_ns_timestamp_to_us: bool = False, include_field_ids: bool = False) -> None: -self.file_schema = file_schema +self._file_schema = file_schema self._include_field_ids = include_field_ids -self.downcast_ns_timestamp_to_us = downcast_ns_timestamp_to_us +self._downcast_ns_timestamp_to_us = downcast_ns_timestamp_to_us def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: -file_field = self.file_schema.find_field(field.field_id) +file_field = self._file_schema.find_field(field.field_id) if field.field_type.is_primitive: if field.field_type != file_field.field_type: return values.cast( schema_to_pyarrow(promote(file_field.field_type, field.field_type), include_field_ids=self._include_field_ids) ) elif (target_type := schema_to_pyarrow(field.field_type, include_field_ids=self._include_field_ids)) != values.type: -# Downcasting of nanoseconds to microseconds -if ( -pa.types.is_timestamp(target_type) -and target_type.unit == "us" -and pa.types.is_timestamp(values.type) -and values.type.unit == "ns" -): -return values.cast(target_type, safe=False) +if field.field_type == TimestampType(): Review Comment: stricter and clearer `field_type` driven checks ## tests/io/test_pyarrow.py: ## @@ -1798,3 +1799,35 @@ def test_identity_partition_on_multi_columns() -> None: ("n_legs", "ascending"), ("animal", "ascending"), ]) == arrow_table.sort_by([("born_year", "ascending"), ("n_legs", "ascending"), ("animal", "ascending")]) + + +def test_to_requested_schema_timestamps( Review Comment: I noticed that there weren't any tests for `to_requested_schema` although it is a public API, so I added this in. Do we want to keep `to_requested_schema` as a public API? ## pyiceberg/io/pyarrow.py: ## @@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: and pa.types.is_timestamp(values.type) and values.type.unit == "ns" ): -return values.cast(target_type, safe=False) +if target_type.tz == "UTC" and values.type.tz in UTC_ALIASES or not target_type.tz and not values.type.tz: +return values.cast(target_type, safe=False) +if ( +pa.types.is_timestamp(target_type) +and target_type.unit == "us" +and pa.types.is_timestamp(values.type) +and values.type.unit in {"s", "ms", "us"} Review Comment: Sounds good @Fokko- thank you for the review. I've made these checks stricter and also clearer for users to follow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Bump coverage from 7.5.4 to 7.6.0 [iceberg-python]
dependabot[bot] opened a new pull request, #917: URL: https://github.com/apache/iceberg-python/pull/917 Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.4 to 7.6.0. Changelog Sourced from https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst";>coverage's changelog. Version 7.6.0 — 2024-07-11 Exclusion patterns can now be multi-line, thanks to Daniel Diniz. This enables many interesting exclusion use-cases, including those requested in issues 118 (entire files), 996 _ (multiple lines only when appearing together), 1741 _ (remainder of a function), and 1803 _ (arbitrary sequence of marked lines). See the :ref:multi_line_exclude section of the docs for more details and examples. The JSON report now includes per-function and per-class coverage information. Thanks to Daniel Diniz _ for getting the work started. This closes issue 1793_ and issue 1532_. Fixed an incorrect calculation of "(no class)" lines in the HTML classes report. Python 3.13.0b3 is supported. .. _issue 118: https://redirect.github.com/nedbat/coveragepy/issues/118";>nedbat/coveragepy#118 .. _issue 996: https://redirect.github.com/nedbat/coveragepy/issues/996";>nedbat/coveragepy#996 .. _issue 1532: https://redirect.github.com/nedbat/coveragepy/issues/1532";>nedbat/coveragepy#1532 .. _issue 1741: https://redirect.github.com/nedbat/coveragepy/issues/1741";>nedbat/coveragepy#1741 .. _issue 1793: https://redirect.github.com/nedbat/coveragepy/issues/1793";>nedbat/coveragepy#1793 .. _issue 1803: https://redirect.github.com/nedbat/coveragepy/issues/1803";>nedbat/coveragepy#1803 .. _pull 1807: https://redirect.github.com/nedbat/coveragepy/pull/1807";>nedbat/coveragepy#1807 .. _pull 1809: https://redirect.github.com/nedbat/coveragepy/pull/1809";>nedbat/coveragepy#1809 .. _changes_7-5-4: Commits https://github.com/nedbat/coveragepy/commit/59a3cd7cecbf45378b9d1f4eda90826258233d62";>59a3cd7 docs: sample HTML for 7.6.0 https://github.com/nedbat/coveragepy/commit/7f27fa7810e494a75af08b54c4f97b94df7364f4";>7f27fa7 docs: prep for 7.6.0 https://github.com/nedbat/coveragepy/commit/6a268b059515e2768931ce6454dcd27304520d8a";>6a268b0 docs: issues closed by the json region reporting https://github.com/nedbat/coveragepy/commit/5bfe9e770304e0b0b346de2441c83300f9da0edf";>5bfe9e7 chore: bump actions/setup-python from 5.1.0 to 5.1.1 (https://redirect.github.com/nedbat/coveragepy/issues/1814";>#1814) https://github.com/nedbat/coveragepy/commit/ab609ef0fb235454050cf8383427ce5f1b0ec8e9";>ab609ef docs: mention json region reporting in the changes https://github.com/nedbat/coveragepy/commit/92d96b91b78639cdb50cbba9f7848dd9e75382d7";>92d96b9 fix: json report needs 'no class' and 'no function' also https://github.com/nedbat/coveragepy/commit/e47e7e758bfc48537f0f21d40cef8e5fa2a076c6";>e47e7e7 refactor: move duplicate code into methods https://github.com/nedbat/coveragepy/commit/3d6be2b3284d30d1668afeeb383430ddc402ce4d";>3d6be2b fix: json format should bump for regions https://github.com/nedbat/coveragepy/commit/a9992d2bff1f12db61c114e6d61d5f35873ae84a";>a9992d2 test: add a test of json regions with branches https://github.com/nedbat/coveragepy/commit/8b8976462b8b7be74716e83efaf05e22f477ef72";>8b89764 test: json expectations should have explicit format number Additional commits viewable in https://github.com/nedbat/coveragepy/compare/7.5.4...7.6.0";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ig