date:20240711

Re: [I] Move `_determine_partitions` to `pyarrow.py` [iceberg-python]

2024-07-11 Thread via GitHub



HonahX closed issue #896: Move `_determine_partitions` to `pyarrow.py`
URL: https://github.com/apache/iceberg-python/issues/896


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Rename `data_sequence_number` to `sequence_number` [iceberg-python]

2024-07-11 Thread via GitHub



HonahX closed issue #893: Rename `data_sequence_number` to `sequence_number`
URL: https://github.com/apache/iceberg-python/issues/893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Namespace not exists after creating the namespace by REST [iceberg]

2024-07-11 Thread via GitHub



nastra closed issue #10656: Namespace not exists after creating the namespace 
by REST
URL: https://github.com/apache/iceberg/issues/10656


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Namespace not exists after creating the namespace by REST [iceberg]

2024-07-11 Thread via GitHub



nastra commented on issue #10656:
URL: https://github.com/apache/iceberg/issues/10656#issuecomment-210304

   The tabular catalog doesn't support nested namespaces like 
`Namespace.of("public", "default")`. Given that this is a vendor-specific 
thing, I'll go ahead and close this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



lurnagao-dahua commented on code in PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#discussion_r1671665042


##
mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -144,21 +147,32 @@ public List getSplits(JobContext context) {
 InputFormatConfig.InMemoryDataModel model =
 conf.getEnum(
 InputFormatConfig.IN_MEMORY_DATA_MODEL, 
InputFormatConfig.InMemoryDataModel.GENERIC);
-try (CloseableIterable tasksIterable = scan.planTasks()) 
{
-  Table serializableTable = SerializableTable.copyOf(table);
-  tasksIterable.forEach(
-  task -> {
-if (applyResidual
-&& (model == InputFormatConfig.InMemoryDataModel.HIVE
-|| model == InputFormatConfig.InMemoryDataModel.PIG)) {
-  // TODO: We do not support residual evaluation for HIVE and PIG 
in memory data model
-  // yet
-  checkResiduals(task);
-}
-splits.add(new IcebergSplit(serializableTable, conf, task));
-  });
-} catch (IOException e) {
-  throw new UncheckedIOException(String.format("Failed to close table 
scan: %s", scan), e);
+final ExecutorService workerPool =
+ThreadPools.newWorkerPool(
+"iceberg-plan-worker-pool",
+conf.getInt(
+SystemConfigs.WORKER_THREAD_POOL_SIZE.propertyKey(),
+ThreadPools.WORKER_THREAD_POOL_SIZE));
+try {
+  scan = scan.planWith(workerPool);

Review Comment:
   > This test can also run in non-kerberos env, so i think you can use 
`org.apache.hadoop.security.UserGroupInformation` to login or get ugi info, 
then you don't need to launch the MiniKdc service.
   
   
   In the local file system, process users are used to read and write files, 
not ugi users. File permissions are determined by the permissions of the Linux 
system, so how to verify this issue?
   Moreover, work_pool will shut down after execution and cannot obtain the 
actual ugi in the thread,and the scope of the `planInputSplits `method is 
private.
   Can you provide further guidance on how to write this unit test case?
   Looking forward to your reply very much!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



lurnagao-dahua commented on code in PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#discussion_r1673566838


##
mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -144,21 +147,32 @@ public List getSplits(JobContext context) {
 InputFormatConfig.InMemoryDataModel model =
 conf.getEnum(
 InputFormatConfig.IN_MEMORY_DATA_MODEL, 
InputFormatConfig.InMemoryDataModel.GENERIC);
-try (CloseableIterable tasksIterable = scan.planTasks()) 
{
-  Table serializableTable = SerializableTable.copyOf(table);
-  tasksIterable.forEach(
-  task -> {
-if (applyResidual
-&& (model == InputFormatConfig.InMemoryDataModel.HIVE
-|| model == InputFormatConfig.InMemoryDataModel.PIG)) {
-  // TODO: We do not support residual evaluation for HIVE and PIG 
in memory data model
-  // yet
-  checkResiduals(task);
-}
-splits.add(new IcebergSplit(serializableTable, conf, task));
-  });
-} catch (IOException e) {
-  throw new UncheckedIOException(String.format("Failed to close table 
scan: %s", scan), e);
+final ExecutorService workerPool =
+ThreadPools.newWorkerPool(
+"iceberg-plan-worker-pool",
+conf.getInt(
+SystemConfigs.WORKER_THREAD_POOL_SIZE.propertyKey(),
+ThreadPools.WORKER_THREAD_POOL_SIZE));
+try {
+  scan = scan.planWith(workerPool);

Review Comment:
   > This test can also run in non-kerberos env, so i think you can use 
`org.apache.hadoop.security.UserGroupInformation` to login or get ugi info, 
then you don't need to launch the MiniKdc service.
   
   Thank you for your reply!
   Sorry for bothering you.
   May I ask if you could review my code again when you have some free time?
   I would really appreciate it!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



lurnagao-dahua commented on code in PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#discussion_r1671717882


##
mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -144,21 +147,32 @@ public List getSplits(JobContext context) {
 InputFormatConfig.InMemoryDataModel model =
 conf.getEnum(
 InputFormatConfig.IN_MEMORY_DATA_MODEL, 
InputFormatConfig.InMemoryDataModel.GENERIC);
-try (CloseableIterable tasksIterable = scan.planTasks()) 
{
-  Table serializableTable = SerializableTable.copyOf(table);
-  tasksIterable.forEach(
-  task -> {
-if (applyResidual
-&& (model == InputFormatConfig.InMemoryDataModel.HIVE
-|| model == InputFormatConfig.InMemoryDataModel.PIG)) {
-  // TODO: We do not support residual evaluation for HIVE and PIG 
in memory data model
-  // yet
-  checkResiduals(task);
-}
-splits.add(new IcebergSplit(serializableTable, conf, task));
-  });
-} catch (IOException e) {
-  throw new UncheckedIOException(String.format("Failed to close table 
scan: %s", scan), e);
+final ExecutorService workerPool =
+ThreadPools.newWorkerPool(
+"iceberg-plan-worker-pool",
+conf.getInt(
+SystemConfigs.WORKER_THREAD_POOL_SIZE.propertyKey(),
+ThreadPools.WORKER_THREAD_POOL_SIZE));
+try {
+  scan = scan.planWith(workerPool);

Review Comment:
   Thank you for your reply!
   the scope of the planInputSplits method is private.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] iceberg-aws-bundle jar includes org.slf4j.LoggerFactory [iceberg]

2024-07-11 Thread via GitHub



nastra commented on issue #10534:
URL: https://github.com/apache/iceberg/issues/10534#issuecomment-268080

   > Typically, I'd expect iceberg-aws-bundle to declare a dependency on 
slf4j-api
   
   The general idea of `iceberg-aws-bundle` is to be a fat jar that bundles 
everything related to AWS dependencies, meaning that such a jar wouldn't 
declare any other dependencies. Have you tried excluding slf4j when you consume 
`iceberg-aws-bundle` as a dependency via `exclude group: 'org.slf4j'`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



lurnagao-dahua commented on code in PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#discussion_r1670390955


##
mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -144,21 +147,32 @@ public List getSplits(JobContext context) {
 InputFormatConfig.InMemoryDataModel model =
 conf.getEnum(
 InputFormatConfig.IN_MEMORY_DATA_MODEL, 
InputFormatConfig.InMemoryDataModel.GENERIC);
-try (CloseableIterable tasksIterable = scan.planTasks()) 
{
-  Table serializableTable = SerializableTable.copyOf(table);
-  tasksIterable.forEach(
-  task -> {
-if (applyResidual
-&& (model == InputFormatConfig.InMemoryDataModel.HIVE
-|| model == InputFormatConfig.InMemoryDataModel.PIG)) {
-  // TODO: We do not support residual evaluation for HIVE and PIG 
in memory data model
-  // yet
-  checkResiduals(task);
-}
-splits.add(new IcebergSplit(serializableTable, conf, task));
-  });
-} catch (IOException e) {
-  throw new UncheckedIOException(String.format("Failed to close table 
scan: %s", scan), e);
+final ExecutorService workerPool =
+ThreadPools.newWorkerPool(
+"iceberg-plan-worker-pool",
+conf.getInt(
+SystemConfigs.WORKER_THREAD_POOL_SIZE.propertyKey(),
+ThreadPools.WORKER_THREAD_POOL_SIZE));
+try {
+  scan = scan.planWith(workerPool);

Review Comment:
   > Looks good! After this change, each fetch task will new its own thread 
pool, so the ugi problem is gone.
   > 
   > But it would be good if you can try to add a unit test.
   
   Hi, Thank you for your reply!
   I have been thinking for a while and believe that the unit test case for 
obtaining the output results already exists.
   But if you want to verify the fixed bug, should me grant permissions to the 
user to the table? I'm not sure how to grant permissions in unit test cases,
   May I ask if you can give me some guidance on writing unit test.
   Looking forward to your reply very much!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Declare avro as an api dependency of iceberg-core [iceberg]

2024-07-11 Thread via GitHub



nastra commented on PR #10573:
URL: https://github.com/apache/iceberg/pull/10573#issuecomment-329253

   Adding some additional context around `api` vs `implementation` for other 
reviewers:
   
   > Dependencies appearing in the api configurations will be transitively 
exposed to consumers of the library, and as such will appear on the compile 
classpath of consumers. Dependencies found in the implementation configuration 
will, on the other hand, not be exposed to consumers, and therefore not leak 
into the consumers' compile classpath


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] PyArrow: Don't enforce the schema [iceberg-python]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #902:
URL: https://github.com/apache/iceberg-python/pull/902#discussion_r1673729169


##
tests/integration/test_deletes.py:
##
@@ -291,7 +291,7 @@ def 
test_partitioned_table_positional_deletes_sequence_number(spark: SparkSessio
 assert snapshots[2].summary == Summary(
 Operation.OVERWRITE,
 **{
-"added-files-size": "1145",
+"added-files-size": "1125",

Review Comment:
   Eagle eye, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1673737827


##
pyiceberg/io/pyarrow.py:
##
@@ -937,7 +937,7 @@ def primitive(self, primitive: pa.DataType) -> 
PrimitiveType:
 else:
 raise TypeError(f"Unsupported precision for timestamp type: 
{primitive.unit}")
 
-if primitive.tz == "UTC" or primitive.tz == "+00:00":
+if primitive.tz in ("UTC", "+00:00", "Etc/UTC"):

Review Comment:
   Supernit:
   ```suggestion
   if primitive.tz in {"UTC", "+00:00", "Etc/UTC"}:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



findepi commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673751208


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+
+

Review Comment:
   Both messages are clear and actionable. My preference would be not to add 
configuration that is not needed.
   I noticed we do set this for other checks, and I believe we can remove those 
other configurations too, in places where they do not add much value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



findepi commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673753534


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+

Review Comment:
   to leverage possessive quantifiers replace 
   ```
   ^[A-Z][A-Z0-9]*(_[A-Z0-9]+)*$
   ```
   
   with 
   ```
   ^[A-Z][A-Z0-9]*+(_[A-Z0-9]++)*+$
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] DynConstructors cleanup [iceberg]

2024-07-11 Thread via GitHub



findepi commented on code in PR #10542:
URL: https://github.com/apache/iceberg/pull/10542#discussion_r1673801761


##
common/src/test/java/org/apache/iceberg/common/TestDynConstructors.java:
##
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.common;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+
+import org.junit.jupiter.api.Test;
+
+public class TestDynConstructors {
+  @Test
+  public void testImplNewInstance() throws Exception {
+DynConstructors.Ctor ctor =
+DynConstructors.builder().impl(MyClass.class).buildChecked();
+assertThat(ctor.newInstance()).isInstanceOf(MyClass.class);
+  }
+
+  @Test
+  public void testInterfaceImplNewInstance() throws Exception {
+DynConstructors.Ctor ctor =
+DynConstructors.builder(MyInterface.class)
+.impl("org.apache.iceberg.common.TestDynConstructors$MyClass")
+.buildChecked();
+assertThat(ctor.newInstance()).isInstanceOf(MyClass.class);
+  }
+
+  @Test
+  public void testInterfaceWrongImplString() throws Exception {
+DynConstructors.Ctor ctor =
+DynConstructors.builder(MyInterface.class)
+// TODO this should throw, since the MyUnrelatedClass does not 
implement MyInterface
+
.impl("org.apache.iceberg.common.TestDynConstructors$MyUnrelatedClass")
+.buildChecked();
+assertThatThrownBy(ctor::newInstance)
+.isInstanceOf(ClassCastException.class)
+.hasMessage(
+"org.apache.iceberg.common.TestDynConstructors$MyUnrelatedClass 
cannot be cast to org.apache.iceberg.common.TestDynConstructors$MyInterface");
+  }
+
+  @Test
+  public void testInterfaceWrongImplClass() throws Exception {
+DynConstructors.Ctor ctor =
+DynConstructors.builder(MyInterface.class)
+// TODO this should throw or not compile at all, since the 
MyUnrelatedClass does not

Review Comment:
   I can give this a try @rdblue .  one concern is that the DynConstructors and 
other Dyn* classes appear to be considered a part of iceberg-common API, so 
changes around generics would technically be breaking changes.
   @rdblue how would you recommend going about this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] PyArrow: Don't enforce the schema [iceberg-python]

2024-07-11 Thread via GitHub



Fokko merged PR #902:
URL: https://github.com/apache/iceberg-python/pull/902


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify which columns can be used for equality delete files. [iceberg]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1673839996


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   Nit: removes the linebreak
   ```suggestion
   Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a parent struct column is 
null it implies the leaf column is null).
   ```



##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   Thanks for the clarification, this makes sense to me 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify time travel implementation in Iceberg [iceberg]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #8982:
URL: https://github.com/apache/iceberg/pull/8982#discussion_r1673841605


##
format/spec.md:
##
@@ -1370,3 +1370,16 @@ Writing v2 metadata:
 * `sort_columns` was removed
 
 Note that these requirements apply when writing data to a v2 table. Tables 
that are upgraded from v1 may contain metadata that does not follow these 
requirements. Implementations should remain backward-compatible with v1 
metadata requirements.
+
+## Appendix F: Implementation Notes
+
+This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification
+to help maintain a uniform experience.
+
+### Point in Time Reads (Time Travel)
+
+Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 

Review Comment:
   Gentle ping @aokolnychyi 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify time travel implementation in Iceberg [iceberg]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #8982:
URL: https://github.com/apache/iceberg/pull/8982#discussion_r1673855957


##
format/spec.md:
##
@@ -1370,3 +1370,16 @@ Writing v2 metadata:
 * `sort_columns` was removed
 
 Note that these requirements apply when writing data to a v2 table. Tables 
that are upgraded from v1 may contain metadata that does not follow these 
requirements. Implementations should remain backward-compatible with v1 
metadata requirements.
+
+## Appendix F: Implementation Notes
+
+This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification
+to help maintain a uniform experience.
+
+### Point in Time Reads (Time Travel)
+
+Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 
+might indicate different snapshot IDs for a specific timestamp. The 
discrepencies can be caused by a variety of table operations (e.g. branch-merge 
table workflows or forcing the current state of table to specific snapshot ID).

Review Comment:
   I think this explains it all, after reading the comment 
https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420:
   ```suggestion
   might indicate different snapshot IDs for a specific timestamp. The 
discrepancies can be caused by a variety of table operations (e.g. updating the 
`current-snapshot-id` of the table).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Clarify time travel implementation in Iceberg [iceberg]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #8982:
URL: https://github.com/apache/iceberg/pull/8982#discussion_r1673855957


##
format/spec.md:
##
@@ -1370,3 +1370,16 @@ Writing v2 metadata:
 * `sort_columns` was removed
 
 Note that these requirements apply when writing data to a v2 table. Tables 
that are upgraded from v1 may contain metadata that does not follow these 
requirements. Implementations should remain backward-compatible with v1 
metadata requirements.
+
+## Appendix F: Implementation Notes
+
+This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification
+to help maintain a uniform experience.
+
+### Point in Time Reads (Time Travel)
+
+Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 
+might indicate different snapshot IDs for a specific timestamp. The 
discrepencies can be caused by a variety of table operations (e.g. branch-merge 
table workflows or forcing the current state of table to specific snapshot ID).

Review Comment:
   I think this explains it all (to me at least :), after reading the comment 
https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420:
   ```suggestion
   might indicate different snapshot IDs for a specific timestamp. The 
discrepancies can be caused by a variety of table operations (e.g. updating the 
`current-snapshot-id` of the table).
   ```



##
format/spec.md:
##
@@ -1370,3 +1370,16 @@ Writing v2 metadata:
 * `sort_columns` was removed
 
 Note that these requirements apply when writing data to a v2 table. Tables 
that are upgraded from v1 may contain metadata that does not follow these 
requirements. Implementations should remain backward-compatible with v1 
metadata requirements.
+
+## Appendix F: Implementation Notes
+
+This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification
+to help maintain a uniform experience.
+
+### Point in Time Reads (Time Travel)
+
+Iceberg supports two types of histories for tables. A history of previous 
"current snapshots" stored in ["snapshot-log" table 
metadata](#table-metadata-fields) and [parent-child lineage stored in 
"snapshots"](#table-metadata-fields). These two histories 
+might indicate different snapshot IDs for a specific timestamp. The 
discrepencies can be caused by a variety of table operations (e.g. branch-merge 
table workflows or forcing the current state of table to specific snapshot ID).

Review Comment:
   I think this explains it all (to me at least :), after reading the comment 
on PR  https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420:
   ```suggestion
   might indicate different snapshot IDs for a specific timestamp. The 
discrepancies can be caused by a variety of table operations (e.g. updating the 
`current-snapshot-id` of the table).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



lurnagao-dahua commented on PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#issuecomment-698145

   @lurnagao-dahua 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support building with Java 21 [iceberg]

2024-07-11 Thread via GitHub



nastra commented on code in PR #10474:
URL: https://github.com/apache/iceberg/pull/10474#discussion_r1673871100


##
baseline.gradle:
##
@@ -46,7 +46,13 @@ subprojects {
   apply plugin: 'com.palantir.baseline-exact-dependencies'
   apply plugin: 'com.palantir.baseline-release-compatibility'
   // We need to update Google Java Format to 1.17.0+ to run spotless on JDK 8, 
but that requires dropping support for JDK 8.
-  if (JavaVersion.current() != JavaVersion.VERSION_21) {
+  if (JavaVersion.current() == JavaVersion.VERSION_21) {
+task spotlessApply {
+  doLast {
+throw new GradleException("Spotless plugin is currently disabled when 
running on JDK 21 (until we drop JDK 8). To run spotlessApply please use a 
different JDK version.")

Review Comment:
   thanks for adding that msg, that's going to be helpful



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673884503


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+

Review Comment:
   Thanks, I see. I experimented a bit with the possessive quantifiers in the 
regexes as well as removing some custom messages that seem redundant. 
   
   I created a draft PR with these changes: #10681. Even though it seems like 
just a few lines of changes, it has an impact for the whole codebase and I 
don't want to hold up this one. (I'd prefer to keep the scope of this PR tight, 
really just renaming the constants to uppercase and enforcing that rule.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673885080


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+
+

Review Comment:
   Got it. Pls see me previous comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



nastra commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673890383


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+

Review Comment:
   I think for this PR we should just stick with the [default regex pattern of 
checkstyle](https://checkstyle.sourceforge.io/checks/naming/constantname.html), 
which I believe is what you had initially. Improving the regex can be done in a 
separate PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deprecate `oauth/tokens` endpoint [iceberg]

2024-07-11 Thread via GitHub



jbonofre commented on code in PR #10603:
URL: https://github.com/apache/iceberg/pull/10603#discussion_r1673893448


##
open-api/rest-catalog-open-api.yaml:
##
@@ -134,9 +134,17 @@ paths:
 post:
   tags:
 - OAuth2 API
-  summary: Get a token using an OAuth2 flow
+  summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL)
+  deprecated: true
   operationId: getToken
   description:
+The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ 
recommended to

Review Comment:
   @danielcweeks I'm confused. I think there's no problem to merge the 
deprecation flag and message right now, correct ? So, I propose to merge this 
PR right now, and plan "removal" (impacting client) on 2.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Deprecate `oauth/tokens` endpoint [iceberg]

2024-07-11 Thread via GitHub



jbonofre commented on code in PR #10603:
URL: https://github.com/apache/iceberg/pull/10603#discussion_r1673893448


##
open-api/rest-catalog-open-api.yaml:
##
@@ -134,9 +134,17 @@ paths:
 post:
   tags:
 - OAuth2 API
-  summary: Get a token using an OAuth2 flow
+  summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL)
+  deprecated: true
   operationId: getToken
   description:
+The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ 
recommended to

Review Comment:
   @danielcweeks I'm confused. I think there's no problem to merge the 
deprecation flag and message right now, correct ? So, I propose to merge this 
PR right now, and plan "removal" (impacting client) on 2.0. @snazy wdyt about 
just updating the deprecation message to target 2.0.0 ?
   @danielcweeks maybe worth to start discussing 2.0.0 somehow ? 😄 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673906437


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+

Review Comment:
   @nastra, yep, same thought here, hence the spin-off draft PR.
   
   In this PR the definition I used is this (this is the current state right 
now):
   ``
   
   the default you linked is this:
   ``
   
   So the only difference is I removed the first part the allows for the 
lowercase `log(ger)` as this codebase doesn't use it, so it was redundant.
   
   Also we have the custom message in place to be consistent with the rest of 
the definitions for now.
   
   Pls LMK if this is all good as it right now or we need some changes here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673906437


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+

Review Comment:
   @nastra, yep, same thought here, hence the spin-off draft PR.
   
   In this PR the definition I used is this (this is the current state right 
now):
   ``
   
   the default you linked is this:
   ``
   
   So the only difference is I removed the first part that allows for the 
lowercase `log(ger)` as this codebase doesn't use it, so it was redundant.
   
   Also we have the custom message in place to be consistent with the rest of 
the definitions for now.
   
   Pls LMK if this is all good as it right now or we need some changes here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673906437


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+

Review Comment:
   @nastra, yep, same thought here, hence the spin-off draft PR.
   
   In this PR the definition I used is this (this is the current state right 
now):
   ``
   
   the default you linked is this:
   ``
   
   So the only difference is I removed the first part that allows for the 
lowercase `log(ger)` as this codebase doesn't use it, so it was redundant.
   
   Also we have the custom message in place to be consistent with the rest of 
the definitions for now.
   
   Pls LMK if this is all good as it is right now or we need some changes here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673906437


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+

Review Comment:
   @nastra, yep, same thought here, hence the spin-off draft PR.
   
   In this PR the definition I used is this (this is the current state right 
now):
   ``
   
   the default you linked is this:
   ``
   
   So the only difference is I removed the first part that allows for the 
lowercase `log(ger)` as this codebase doesn't use it, so it was redundant.
   
   Also we have the custom message in place to be consistent with the rest of 
the definitions for now.
   
   Pls LMK if this is all good as it is right now or we need some changes here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Update Snapshot Retention Properties [iceberg-python]

2024-07-11 Thread via GitHub



chinmay-bhat opened a new pull request, #913:
URL: https://github.com/apache/iceberg-python/pull/913

   WIP
   - will be ready to review once rollback PR 758 is merged into main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Update checkstyle definition [iceberg]

2024-07-11 Thread via GitHub



findepi commented on code in PR #10681:
URL: https://github.com/apache/iceberg/pull/10681#discussion_r1673961429


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -480,23 +475,20 @@
  
  
 
-
+
 
-
 
  
  
-
-
+

Review Comment:
   This one imo can't be possessive because of the trailing `T` requirement. 
For example `FooT` matched the previous pattern, but does not match the new 
pattern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673885080


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+
+

Review Comment:
   Got it. Pls see my previous comment about 
https://github.com/apache/iceberg/pull/10681.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10673:
URL: https://github.com/apache/iceberg/pull/10673#discussion_r1673885080


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -284,6 +284,10 @@
 
 
 
+
+
+

Review Comment:
   Got it. Pls see my previous comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Expire Snapshots [iceberg-python]

2024-07-11 Thread via GitHub



chinmay-bhat opened a new pull request, #914:
URL: https://github.com/apache/iceberg-python/pull/914

   WIP
   - [x] initial skeleton
   - [ ] clean expired snapshots
   - [ ] write tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Update checkstyle definition [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10681:
URL: https://github.com/apache/iceberg/pull/10681#discussion_r1673979787


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -480,23 +475,20 @@
  
  
 
-
+
 
-
 
  
  
-
-
+

Review Comment:
   You're right, thx for point it out. I reverted this one change. I double 
checked the other changes for similar issues and they LGTM. Pls chk and LMK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Update checkstyle definition [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on code in PR #10681:
URL: https://github.com/apache/iceberg/pull/10681#discussion_r1673979787


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -480,23 +475,20 @@
  
  
 
-
+
 
-
 
  
  
-
-
+

Review Comment:
   You're right, thx for pointing it out. I reverted this one change. I double 
checked the other changes for similar issues and they LGTM. Pls chk and LMK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Update checkstyle definition [iceberg]

2024-07-11 Thread via GitHub



findepi commented on code in PR #10681:
URL: https://github.com/apache/iceberg/pull/10681#discussion_r1673988656


##
.baseline/checkstyle/checkstyle.xml:
##
@@ -480,23 +475,20 @@
  
  
 
-
+
 
-
 
  
  
-
-
+

Review Comment:
   thanks. i also did check other changes. LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: don't include slf4j-api in bundled JARs [iceberg]

2024-07-11 Thread via GitHub



bryanck commented on PR #10665:
URL: https://github.com/apache/iceberg/pull/10665#issuecomment-975393

   I think this makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add checkstyle rule for uppercase constant fields [iceberg]

2024-07-11 Thread via GitHub



nastra merged PR #10673:
URL: https://github.com/apache/iceberg/pull/10673


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]

2024-07-11 Thread via GitHub



nastra commented on PR #10682:
URL: https://github.com/apache/iceberg/pull/10682#issuecomment-981956

   @deniskuzZ can you add the missing `@Override` please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: don't include slf4j-api in bundled JARs [iceberg]

2024-07-11 Thread via GitHub



bryanck merged PR #10665:
URL: https://github.com/apache/iceberg/pull/10665


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] iceberg-aws-bundle jar includes org.slf4j.LoggerFactory [iceberg]

2024-07-11 Thread via GitHub



bryanck closed issue #10534: iceberg-aws-bundle jar includes 
org.slf4j.LoggerFactory
URL: https://github.com/apache/iceberg/issues/10534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Fix NPE during conflict handling of NULL partitions [iceberg]

2024-07-11 Thread via GitHub



deniskuzZ commented on code in PR #10680:
URL: https://github.com/apache/iceberg/pull/10680#discussion_r1674048472


##
core/src/main/java/org/apache/iceberg/util/PartitionSet.java:
##
@@ -200,7 +200,8 @@ public String toString() {
   StringBuilder partitionStringBuilder = new StringBuilder();
   partitionStringBuilder.append(structType.fields().get(i).name());
   partitionStringBuilder.append("=");
-  partitionStringBuilder.append(s.get(i, Object.class).toString());
+  Object value = s.get(i, Object.class);
+  partitionStringBuilder.append(value == null ? "null" : 
value.toString());

Review Comment:
   please use `String.valueOf(s.get(i, Object.class))` instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] iceberg-aws-bundle jar includes org.slf4j.LoggerFactory [iceberg]

2024-07-11 Thread via GitHub



bryanck closed issue #10534: iceberg-aws-bundle jar includes 
org.slf4j.LoggerFactory
URL: https://github.com/apache/iceberg/issues/10534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]

2024-07-11 Thread via GitHub



nastra commented on code in PR #10682:
URL: https://github.com/apache/iceberg/pull/10682#discussion_r1674049469


##
core/src/main/java/org/apache/iceberg/SerializableTable.java:
##
@@ -278,6 +278,10 @@ public TableScan newScan() {
 return lazyTable().newScan();
   }
 
+  public IncrementalAppendScan newIncrementalAppendScan() {

Review Comment:
   can you also override `newIncrementalChangelogScan()` here please?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Spark timestamptz discrepancy [iceberg-python]

2024-07-11 Thread via GitHub



syun64 opened a new pull request, #915:
URL: https://github.com/apache/iceberg-python/pull/915

   I ran into this issue when I was trying to improve our tests for verifying 
write integrity for the timestamp types following our recent implementations 
for supporting inputs of other precisions and timezones.
   
   This Draft PR demonstrates a discrepancy with Spark Iceberg where 
TimestamptzType is being read as TimestampType versus in PyIceberg where it is 
correctly being read as TimestampType.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Bump mypy-boto3-glue from 1.34.136 to 1.34.143 [iceberg-python]

2024-07-11 Thread via GitHub



Fokko merged PR #912:
URL: https://github.com/apache/iceberg-python/pull/912


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Detecting duplicates in the Flink Data Stream API [iceberg]

2024-07-11 Thread via GitHub



lkokhreidze opened a new issue, #10683:
URL: https://github.com/apache/iceberg/issues/10683

   ### Query engine
   
   Flink
   
   ### Question
   
   Hi,
   I was wondering if there's a way we could detect if ongoing batch written to 
the Iceberg table would perform the upsert?
   
   Context:
   I have an Iceberg table that is being written with Flink DataStream API with 
upsert enabled. Identifier field is configured to be the `ID` of the domain. My 
use case is to mark the rows that are upserted with an appropriate flag. And I 
was wondering if there's a way to get a hold of this information within the 
Data Stream API?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Will not be merged: Spark timestamptz discrepancy [iceberg-python]

2024-07-11 Thread via GitHub



syun64 commented on PR #915:
URL: https://github.com/apache/iceberg-python/pull/915#issuecomment-2223035079

   Here's the dataframe schema that represents the Iceberg tables that are 
loaded through:
   1. Spark Iceberg
   2. PyIceberg
   
   Spark Iceberg loads both timestamptz and timestamp types as timestamp, 
whereas PyIceberg is able to distinguish these two types correctly
   ```
   tests/integration/test_writes/test_writes.py:1063: AssertionError
   - Captured stdout call 
-
   timestamp_s   datetime64[ns]
   timestamptz_s datetime64[us]
   timestamp_ms  datetime64[ns]
   timestamptz_msdatetime64[us]
   timestamp_us  datetime64[ns]
   timestamptz_usdatetime64[us]
   timestamp_ns  datetime64[ns]
   timestamptz_nsdatetime64[us]
   timestamptz_us_etc_utcdatetime64[us]
   timestamptz_us_z  datetime64[us]
   dtype: object
   timestamp_sdatetime64[us]
   timestamptz_s datetime64[us, UTC]
   timestamp_ms   datetime64[us]
   timestamptz_msdatetime64[us, UTC]
   timestamp_us   datetime64[us]
   timestamptz_usdatetime64[us, UTC]
   timestamp_ns   datetime64[us]
   timestamptz_nsdatetime64[us, UTC]
   timestamptz_us_etc_utcdatetime64[us, UTC]
   timestamptz_us_z  datetime64[us, UTC]
   dtype: object
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Update checkstyle definition [iceberg]

2024-07-11 Thread via GitHub



attilakreiner commented on PR #10681:
URL: https://github.com/apache/iceberg/pull/10681#issuecomment-2223052922

   Merged in the `main` branch and fixed `ConstantName`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]

2024-07-11 Thread via GitHub



snazy commented on PR #10603:
URL: https://github.com/apache/iceberg/pull/10603#issuecomment-2223065233

   I've updated the PR to mention "2.0".
   
   The CI failures look unrelated, but I don't have the power to rerun those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Fix NPE during conflict handling of NULL partitions [iceberg]

2024-07-11 Thread via GitHub



boroknagyz commented on code in PR #10680:
URL: https://github.com/apache/iceberg/pull/10680#discussion_r1674098559


##
core/src/main/java/org/apache/iceberg/util/PartitionSet.java:
##
@@ -200,7 +200,8 @@ public String toString() {
   StringBuilder partitionStringBuilder = new StringBuilder();
   partitionStringBuilder.append(structType.fields().get(i).name());
   partitionStringBuilder.append("=");
-  partitionStringBuilder.append(s.get(i, Object.class).toString());
+  Object value = s.get(i, Object.class);
+  partitionStringBuilder.append(value == null ? "null" : 
value.toString());

Review Comment:
   Yeah, that's nicer. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-07-11 Thread via GitHub



epgif commented on code in PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#discussion_r1674134865


##
api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java:
##
@@ -112,12 +112,6 @@ public void testSpecValues() {
 .as("Spec example: hash(2017-11-16T22:31:08) = -204791")
 .isEqualTo(-204791);
 
-Literal timestamptzVal =

Review Comment:
   Oops. Undid that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-07-11 Thread via GitHub



epgif commented on code in PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#discussion_r1674135249


##
api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java:
##
@@ -165,6 +159,68 @@ public void testLong() {
 .isEqualTo(hashBytes(buffer.array()));
   }
 
+  @Test
+  public void testTimestampNanoPromotion() {
+// Values from spec Appendix B: 32-bit Hash Requirements
+String timestamp1 = "2017-11-16T22:31:08";
+long expectedHash1 = -204791;
+String timestamp2 = "2017-11-16T22:31:08.01";
+long expectedHash2 = -1207196810;
+String timestampNs1 = "2017-11-16T22:31:08";
+String timestampNs2 = "2017-11-16T22:31:08.01001";
+
+Types.TimestampType tsType = Types.TimestampType.withoutZone();
+Types.TimestampNanoType tsNsType = Types.TimestampNanoType.withoutZone();
+
+Bucket tsNsBucket = Bucket.get(tsNsType, 1);
+Bucket tsBucket = Bucket.get(tsType, 1);
+
+assertThat(tsBucket.hash(Literal.of(timestamp1).to(tsType).value()))
+.as("Timestamp and TimestampNano bucket results should match")

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-07-11 Thread via GitHub



epgif commented on code in PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#discussion_r1674135937


##
api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java:
##
@@ -165,6 +159,68 @@ public void testLong() {
 .isEqualTo(hashBytes(buffer.array()));
   }
 
+  @Test
+  public void testTimestampNanoPromotion() {
+// Values from spec Appendix B: 32-bit Hash Requirements
+String timestamp1 = "2017-11-16T22:31:08";
+long expectedHash1 = -204791;
+String timestamp2 = "2017-11-16T22:31:08.01";
+long expectedHash2 = -1207196810;
+String timestampNs1 = "2017-11-16T22:31:08";

Review Comment:
   This is obviated by the requested inlining.



##
api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java:
##
@@ -165,6 +159,68 @@ public void testLong() {
 .isEqualTo(hashBytes(buffer.array()));
   }
 
+  @Test
+  public void testTimestampNanoPromotion() {
+// Values from spec Appendix B: 32-bit Hash Requirements
+String timestamp1 = "2017-11-16T22:31:08";

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-07-11 Thread via GitHub



epgif commented on PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#issuecomment-2223123048

   > overall this LGTM once comments in `TestBucketing` have been addressed
   
   I've addressed these.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Check if schema is compatible in `add_files` API [iceberg-python]

2024-07-11 Thread via GitHub



syun64 commented on code in PR #907:
URL: https://github.com/apache/iceberg-python/pull/907#discussion_r1674162322


##
pyiceberg/io/pyarrow.py:
##
@@ -2026,6 +2072,8 @@ def parquet_files_to_data_files(io: FileIO, 
table_metadata: TableMetadata, file_
 f"Cannot add file {file_path} because it has field IDs. 
`add_files` only supports addition of files without field_ids"
 )
 schema = table_metadata.schema()
+_check_schema_compatible(schema, 
parquet_metadata.schema.to_arrow_schema())

Review Comment:
   Thank you for the catch @HonahX - I think you are right. Adding a nanosecond 
timestamp file doesn't correctly allow Spark Iceberg to read the file and 
instead results in exceptions like:
   
   ```
   ValueError: year 53177 is out of range
   ```
   
   I will make downcast_ns_timestamp_to_us_on_write an input argument to 
_check_schema_compatible, so that we can prevent nanoseconds timestamp types 
from being added through add_files, but can continue to support it being 
downcast in overwrite/append



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Kafka Connect: Commit coordination [iceberg]

2024-07-11 Thread via GitHub



danielcweeks commented on PR #10351:
URL: https://github.com/apache/iceberg/pull/10351#issuecomment-2223231870

   Thanks @bryanck and @fqaiser94.  It's really great to get this one in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Kafka Connect: Commit coordination [iceberg]

2024-07-11 Thread via GitHub



danielcweeks merged PR #10351:
URL: https://github.com/apache/iceberg/pull/10351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]

2024-07-11 Thread via GitHub



jackye1995 commented on PR #10603:
URL: https://github.com/apache/iceberg/pull/10603#issuecomment-2223257678

   Can you try rebase to see if it fixes the CI?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API: add resultSchema() method to StructTransform [iceberg]

2024-07-11 Thread via GitHub



stevenzwu commented on PR #10496:
URL: https://github.com/apache/iceberg/pull/10496#issuecomment-2223275495

   hmm. the `TestSparkDataFile` failed after this change.
   ```
   Caused by: org.apache.iceberg.exceptions.ValidationException: Invalid 
schema: multiple fields for name ts: 9 and 9
at 
org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49)
at org.apache.iceberg.types.IndexByName.addField(IndexByName.java:195)
at org.apache.iceberg.types.IndexByName.field(IndexByName.java:161)
at org.apache.iceberg.types.IndexByName.field(IndexByName.java:34)
at org.apache.iceberg.types.TypeUtil.visit(TypeUtil.java:611)
at org.apache.iceberg.types.TypeUtil.indexNameById(TypeUtil.java:173)
at org.apache.iceberg.Schema.lazyIdToName(Schema.java:213)
at org.apache.iceberg.Schema.(Schema.java:142)
at org.apache.iceberg.Schema.(Schema.java:99)
at org.apache.iceberg.Schema.(Schema.java:87)
at org.apache.iceberg.Schema.(Schema.java:83)
at org.apache.iceberg.StructTransform.(StructTransform.java:82)
at org.apache.iceberg.PartitionKey.(PartitionKey.java:37)
at 
org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.(SparkWrite.java:774)
   ```
   
   This test class apply two partition spec transformation (identify and hour) 
on the same field `ts`. Is that a legitimate use case? If it is yes, then we 
have a problem with this proposed change.
   
   ```
 private static final PartitionSpec SPEC =
 PartitionSpec.builderFor(SCHEMA)
 .identity("b")
 .bucket("i", 2)
 .identity("l")
 .identity("f")
 .identity("d")
 .identity("date")
 .hour("ts")
 .identity("ts")
 .identity("tsntz")
 .truncate("s", 2)
 .identity("bytes")
 .bucket("dec_9_0", 2)
 .bucket("dec_11_2", 2)
 .bucket("dec_38_10", 2)
 .build();
   ```
   
   cc @rdblue @aokolnychyi @RussellSpitzer @szehon-ho 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]

2024-07-11 Thread via GitHub



danielcweeks commented on code in PR #10603:
URL: https://github.com/apache/iceberg/pull/10603#discussion_r1674264095


##
open-api/rest-catalog-open-api.yaml:
##
@@ -134,9 +134,17 @@ paths:
 post:
   tags:
 - OAuth2 API
-  summary: Get a token using an OAuth2 flow
+  summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL)
+  deprecated: true
   operationId: getToken
   description:
+The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ 
recommended to

Review Comment:
   I think we're good now.  The issue was that the target removal was 1.7 
instead of 2.0, which I think was just confusion about deprecation vs. removal. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]

2024-07-11 Thread via GitHub



jbonofre commented on code in PR #10603:
URL: https://github.com/apache/iceberg/pull/10603#discussion_r1674277456


##
open-api/rest-catalog-open-api.yaml:
##
@@ -134,9 +134,17 @@ paths:
 post:
   tags:
 - OAuth2 API
-  summary: Get a token using an OAuth2 flow
+  summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL)
+  deprecated: true
   operationId: getToken
   description:
+The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ 
recommended to

Review Comment:
   @danielcweeks yes, that's why I was confused 😄 It seems we were "mixing" 
deprecation and removal. I agree it looks good now. Thanks !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]

2024-07-11 Thread via GitHub



danielcweeks commented on code in PR #10603:
URL: https://github.com/apache/iceberg/pull/10603#discussion_r1674279099


##
open-api/rest-catalog-open-api.yaml:
##
@@ -134,9 +134,22 @@ paths:
 post:
   tags:
 - OAuth2 API
-  summary: Get a token using an OAuth2 flow
+  summary: Get a token using an OAuth2 flow (DEPRECATED for REMOVAL)
+  deprecated: true
   operationId: getToken
   description:
+The `oauth/tokens` endpoint is **DEPRECATED for REMOVAL**. It is _not_ 
recommended to
+implement this endpoint, unless you are fully aware of the potential 
security implications.
+
+All clients are encouraged to explicitly set the configuration 
property `oauth2-server-uri`
+to the correct OAuth endpoint.
+
+Deprecated since Iceberg (Java) 1.6.0. The endpoint and related types 
will be removed from
+this spec in Iceberg (Java) 2.0, whichever is released sooner.

Review Comment:
   ```suggestion
   this spec in Iceberg (Java) 2.0.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: use bulk delete when removing old metadata.json files [iceberg]

2024-07-11 Thread via GitHub



dramaticlly commented on PR #10679:
URL: https://github.com/apache/iceberg/pull/10679#issuecomment-2223373221

   > I think this is fine, I mentioned this another similar PR though so I 
thought I would note it here as well. We need to make sure our test case now 
runs with Filesystems that both support and don't support BulkOperations 
otherwise we have an untested branch.
   > 
   > 
   > 
   > Is that the case here? We may need to start parameterizing the tests again 
along the axis of File System Capabilities. 
   
   I think so, I was using testMetadataFileLocationsRemovalAfterCommit in 
CatalogTests.java to verify the correctness of this change. It was parametrized 
to test various catalog, where in memory catalog is using in-memory fileIO (not 
support bulk deletion) and hive catalog is using Hadoop fileIO (support bulk 
deletion), plus various other catalogs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



syun64 commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674334862


##
pyiceberg/table/__init__.py:
##
@@ -528,10 +528,6 @@ def append(self, df: pa.Table, snapshot_properties: 
Dict[str, str] = EMPTY_DICT)
 )
 
 _check_schema_compatible(self._table.schema(), other_schema=df.schema)
-# cast if the two schemas are compatible but not equal
-table_arrow_schema = self._table.schema().as_arrow()
-if table_arrow_schema != df.schema:
-df = df.cast(table_arrow_schema)

Review Comment:
   Removed this cast function - `to_requested_schema` should be responsible for 
casting the types to their desired schema, instead of casting it here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



syun64 commented on PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#issuecomment-2223408958

   @Fokko @HonahX - thank you for your reviews. I've updated the integration 
test to make the [check more 
comprehensive](https://github.com/apache/iceberg-python/pull/910/files#diff-7f3dd1244d08ce27c003cd091da10aa049f7bb0c7d5397acb4ec69767036accdR1063-R1074),
 and removed the [pyarrow table 
casting](https://github.com/apache/iceberg-python/pull/910/files#diff-23e8153e0fd497a9212215bd2067068f3b56fa071770c7ef326db3d3d03cee9bL590-L593)
 in `append` and `overwrite` so that each column can be cast with more targeted 
`safe` flag for each field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



syun64 commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674351468


##
pyiceberg/io/pyarrow.py:
##
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: 
pa.Array) -> pa.Array:
 and pa.types.is_timestamp(values.type)
 and values.type.unit == "ns"
 ):
-return values.cast(target_type, safe=False)
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type, safe=False)
+if (
+pa.types.is_timestamp(target_type)
+and target_type.unit == "us"
+and pa.types.is_timestamp(values.type)
+and values.type.unit in {"s", "ms", "us"}
+):
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type)

Review Comment:
   Does this look correct that we are casting the types in order to follow the 
Iceberg Spec for Parquet Physical and Logical Types? 
https://iceberg.apache.org/spec/#parquet



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Upgrade to Gradle 8.9 [iceberg]

2024-07-11 Thread via GitHub



jbonofre commented on issue #10685:
URL: https://github.com/apache/iceberg/issues/10685#issuecomment-2223424034

   @Fokko @nastra I'm working on the gradle update PR. Do you mind guys to 
assign this ticket to me and set the milestone to 1.6.0 ? Thanks !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Upgrade to Gradle 8.9 [iceberg]

2024-07-11 Thread via GitHub



jbonofre commented on PR #10686:
URL: https://github.com/apache/iceberg/pull/10686#issuecomment-2223435089

   This PR closes #10686 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Upgrade to Gradle 8.9 [iceberg]

2024-07-11 Thread via GitHub



jbonofre commented on PR #10686:
URL: https://github.com/apache/iceberg/pull/10686#issuecomment-2223435767

   @Fokko @nastra thanks in advance gentlemen ! 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Running MERGE INTO with more than one WHEN condition fails if the number of columns in the target table is > 321 [iceberg]

2024-07-11 Thread via GitHub



andreaschiappacasse commented on issue #10294:
URL: https://github.com/apache/iceberg/issues/10294#issuecomment-2223558105

   @krishan711 we ended up using spark instead of athena to do the 
upsert/delete operation. It is still very unfortunate because it is much more 
expensive and adds some complexity to our architecture, but just in case it may 
help you... that's what our team did in our case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]

2024-07-11 Thread via GitHub



deniskuzZ commented on code in PR #10682:
URL: https://github.com/apache/iceberg/pull/10682#discussion_r1674500091


##
core/src/main/java/org/apache/iceberg/SerializableTable.java:
##
@@ -278,6 +278,10 @@ public TableScan newScan() {
 return lazyTable().newScan();
   }
 
+  public IncrementalAppendScan newIncrementalAppendScan() {

Review Comment:
   thanks @nastra for the review! added missing `@Override` annotation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]

2024-07-11 Thread via GitHub



deniskuzZ commented on code in PR #10682:
URL: https://github.com/apache/iceberg/pull/10682#discussion_r1674506153


##
core/src/main/java/org/apache/iceberg/SerializableTable.java:
##
@@ -278,6 +278,10 @@ public TableScan newScan() {
 return lazyTable().newScan();
   }
 
+  public IncrementalAppendScan newIncrementalAppendScan() {

Review Comment:
   `+` newIncrementalChangelogScan() override



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Expose table incremental scan for appends API in SerializableTable [iceberg]

2024-07-11 Thread via GitHub



deniskuzZ commented on code in PR #10682:
URL: https://github.com/apache/iceberg/pull/10682#discussion_r1674506153


##
core/src/main/java/org/apache/iceberg/SerializableTable.java:
##
@@ -278,6 +278,10 @@ public TableScan newScan() {
 return lazyTable().newScan();
   }
 
+  public IncrementalAppendScan newIncrementalAppendScan() {

Review Comment:
   + newIncrementalChangelogScan() override



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Upgrade to Gradle 8.9 [iceberg]

2024-07-11 Thread via GitHub



Fokko merged PR #10686:
URL: https://github.com/apache/iceberg/pull/10686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Upgrade to Gradle 8.9 [iceberg]

2024-07-11 Thread via GitHub



Fokko closed issue #10685: Upgrade to Gradle 8.9
URL: https://github.com/apache/iceberg/issues/10685


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Upgrade to Gradle 8.9 [iceberg]

2024-07-11 Thread via GitHub



Fokko commented on PR #10686:
URL: https://github.com/apache/iceberg/pull/10686#issuecomment-2223698571

   Thanks @jbonofre for bumping Gradle here, and thanks @nastra, @ajantha-bhat 
and @snazy for the prompt review 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] OpenAPI: Deprecate `oauth/tokens` endpoint [iceberg]

2024-07-11 Thread via GitHub



snazy commented on PR #10603:
URL: https://github.com/apache/iceberg/pull/10603#issuecomment-2223703737

   > Can you try rebase to see if it fixes the CI?
   
   CI looking good


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



deniskuzZ commented on PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#issuecomment-2223715282

   LGTM +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



deniskuzZ commented on code in PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#discussion_r1674554724


##
mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java:
##
@@ -381,6 +386,46 @@ public void testCustomCatalog() throws IOException {
 testInputFormat.create(builder.conf()).validate(expectedRecords);
   }
 
+  @TestTemplate
+  public void testWorkerPool() throws Exception {
+// 1.The ugi in the same thread will not change
+final ExecutorService workerPool1 = 
ThreadPools.newWorkerPool("iceberg-plan-worker-pool", 1);
+UserGroupInformation user1 =
+UserGroupInformation.createUserForTesting("user1", new String[] {});
+UserGroupInformation user2 =
+UserGroupInformation.createUserForTesting("user2", new String[] {});
+assertThat(setAndGetUgi(user1, workerPool1)).isEqualTo("user1");
+assertThat(setAndGetUgi(user2, workerPool1)).isEqualTo("user1");
+workerPool1.shutdown();
+
+// 2.The ugi in different threads will be different
+final ExecutorService workerPool2 = 
ThreadPools.newWorkerPool("iceberg-plan-worker-pool", 1);
+assertThat(setAndGetUgi(user2, workerPool2)).isEqualTo("user2");
+workerPool2.shutdown();
+  }
+
+  private String setAndGetUgi(UserGroupInformation ugi, ExecutorService 
workpool)

Review Comment:
   @lurnagao-dahua, the test looks good to me, thanks for adding it!   
   Unfortunately, I don't have committer rights in the iceberg project, but 
I'll add my +1
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



deniskuzZ commented on code in PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#discussion_r1674558387


##
mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java:
##
@@ -381,6 +386,56 @@ public void testCustomCatalog() throws IOException {
 testInputFormat.create(builder.conf()).validate(expectedRecords);
   }
 
+  @TestTemplate
+  public void testWorkerPool() throws Exception {
+Table table = helper.createUnpartitionedTable();
+List records = helper.generateRandomRecords(1, 0L);
+helper.appendToTable(null, records);

Review Comment:
   do we really need to inject some data?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



lurnagao-dahua commented on code in PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#discussion_r1674565642


##
mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java:
##
@@ -381,6 +386,56 @@ public void testCustomCatalog() throws IOException {
 testInputFormat.create(builder.conf()).validate(expectedRecords);
   }
 
+  @TestTemplate
+  public void testWorkerPool() throws Exception {
+Table table = helper.createUnpartitionedTable();
+List records = helper.generateRandomRecords(1, 0L);
+helper.appendToTable(null, records);

Review Comment:
   I think it may only inserting data can create thread in workerpool and and 
initial ugi.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] mr：Fix issues 10639 [iceberg]

2024-07-11 Thread via GitHub



deniskuzZ commented on code in PR #10661:
URL: https://github.com/apache/iceberg/pull/10661#discussion_r1674568873


##
mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java:
##
@@ -381,6 +386,56 @@ public void testCustomCatalog() throws IOException {
 testInputFormat.create(builder.conf()).validate(expectedRecords);
   }
 
+  @TestTemplate
+  public void testWorkerPool() throws Exception {
+Table table = helper.createUnpartitionedTable();
+List records = helper.generateRandomRecords(1, 0L);
+helper.appendToTable(null, records);

Review Comment:
   👍 , i thought we could save some CPU cycles here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Check if schema is compatible in `add_files` API [iceberg-python]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #907:
URL: https://github.com/apache/iceberg-python/pull/907#discussion_r1674603748


##
pyiceberg/io/pyarrow.py:
##
@@ -2026,6 +2072,8 @@ def parquet_files_to_data_files(io: FileIO, 
table_metadata: TableMetadata, file_
 f"Cannot add file {file_path} because it has field IDs. 
`add_files` only supports addition of files without field_ids"
 )
 schema = table_metadata.schema()
+_check_schema_compatible(schema, 
parquet_metadata.schema.to_arrow_schema())

Review Comment:
   In general, I think it is okay to be able to read more broadly. We do need 
tests to ensure that it works correctly. Looking at Arrow, there are already 
some physical types that we don't support (`date64`, etc). In Java, we do 
support reading Timestamps that are encoded using `INT96`, we should not 
produce them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Check if schema is compatible in `add_files` API [iceberg-python]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #907:
URL: https://github.com/apache/iceberg-python/pull/907#discussion_r1674607135


##
pyiceberg/io/pyarrow.py:
##
@@ -166,6 +166,7 @@
 
 ONE_MEGABYTE = 1024 * 1024
 BUFFER_SIZE = "buffer-size"
+

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674682639


##
pyiceberg/io/pyarrow.py:
##
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: 
pa.Array) -> pa.Array:
 and pa.types.is_timestamp(values.type)
 and values.type.unit == "ns"
 ):
-return values.cast(target_type, safe=False)
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type, safe=False)
+if (
+pa.types.is_timestamp(target_type)
+and target_type.unit == "us"
+and pa.types.is_timestamp(values.type)
+and values.type.unit in {"s", "ms", "us"}
+):
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type)

Review Comment:
   Yes, it looks good. Arrow writes `INT64` by default for timestamps: 
https://arrow.apache.org/docs/cpp/parquet.html#logical-types



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674684198


##
pyiceberg/io/pyarrow.py:
##
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: 
pa.Array) -> pa.Array:
 and pa.types.is_timestamp(values.type)
 and values.type.unit == "ns"
 ):
-return values.cast(target_type, safe=False)
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type, safe=False)
+if (
+pa.types.is_timestamp(target_type)
+and target_type.unit == "us"
+and pa.types.is_timestamp(values.type)
+and values.type.unit in {"s", "ms", "us"}
+):
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:

Review Comment:
   Can we add parentheses? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674686353


##
pyiceberg/io/pyarrow.py:
##
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: 
pa.Array) -> pa.Array:
 and pa.types.is_timestamp(values.type)
 and values.type.unit == "ns"
 ):
-return values.cast(target_type, safe=False)
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type, safe=False)
+if (
+pa.types.is_timestamp(target_type)
+and target_type.unit == "us"
+and pa.types.is_timestamp(values.type)
+and values.type.unit in {"s", "ms", "us"}

Review Comment:
   When we're requesting `us`and the file provide a `us`, I don't think we need 
to cast?
   ```suggestion
   and values.type.unit in {"s", "ms"}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



syun64 commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674689843


##
pyiceberg/io/pyarrow.py:
##
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: 
pa.Array) -> pa.Array:
 and pa.types.is_timestamp(values.type)
 and values.type.unit == "ns"
 ):
-return values.cast(target_type, safe=False)
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type, safe=False)
+if (
+pa.types.is_timestamp(target_type)
+and target_type.unit == "us"
+and pa.types.is_timestamp(values.type)
+and values.type.unit in {"s", "ms", "us"}

Review Comment:
   Sorry this is a bit confusing, I agree 😓 . I included it here because I 
wanted to support casting `pa.timestamp('us', tz='Etc/UTC')` to 
`pa.timestamp('us', tz='UTC')` within the same condition.
   
   I think we won't hit this condition if both the input and requested types 
are `pa.timestamp('us')` because we enter this block only if target_type and 
values.type are not equal:
   
https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1315



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support Spark Column Stats [iceberg]

2024-07-11 Thread via GitHub



singhpk234 commented on code in PR #10659:
URL: https://github.com/apache/iceberg/pull/10659#discussion_r1674683253


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##
@@ -175,7 +181,25 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
 // its a fresh table, no data
 if (snapshot == null) {
-  return new Stats(0L, 0L);
+  return new Stats(0L, 0L, Maps.newHashMap());
+}
+
+Map map = Maps.newHashMap();
+
+if (readConf.enableColumnStats()) {

Review Comment:
   [Q] NDV can't be used for non CBO cases, do we need to couple this flag by 
checking cbo is enabled addtionally as well `spark.sql.cbo.enabled` other wise 
SizeInBytesOnlyStatsPlanVisitor will be used ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



syun64 commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674691317


##
pyiceberg/io/pyarrow.py:
##
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: 
pa.Array) -> pa.Array:
 and pa.types.is_timestamp(values.type)
 and values.type.unit == "ns"
 ):
-return values.cast(target_type, safe=False)
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type, safe=False)
+if (
+pa.types.is_timestamp(target_type)
+and target_type.unit == "us"
+and pa.types.is_timestamp(values.type)
+and values.type.unit in {"s", "ms", "us"}
+):
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support Spark Column Stats [iceberg]

2024-07-11 Thread via GitHub



szehon-ho commented on code in PR #10659:
URL: https://github.com/apache/iceberg/pull/10659#discussion_r1674693714


##
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkScan.java:
##
@@ -97,6 +117,36 @@ public static Object[][] parameters() {
 };
   }
 
+  @BeforeAll

Review Comment:
   Why is it needed?



##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##
@@ -175,7 +181,25 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
 // its a fresh table, no data
 if (snapshot == null) {
-  return new Stats(0L, 0L);
+  return new Stats(0L, 0L, Maps.newHashMap());
+}
+
+Map map = Maps.newHashMap();
+
+if (readConf.enableColumnStats()) {
+  List files = table.statisticsFiles();
+  if (!files.isEmpty()) {
+List metadataList = (files.get(0)).blobMetadata();
+
+for (BlobMetadata blobMetadata : metadataList) {
+  long ndv = Long.parseLong(blobMetadata.properties().get("ndv"));
+  ColumnStatistics colStats = new ColStats(ndv, null, null, 0L, 0L, 
0L, null);

Review Comment:
   Should we fill the other values from the manifest file?



##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##
@@ -175,7 +181,25 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
 // its a fresh table, no data
 if (snapshot == null) {
-  return new Stats(0L, 0L);
+  return new Stats(0L, 0L, Maps.newHashMap());
+}
+
+Map map = Maps.newHashMap();

Review Comment:
   Can we have a more descriptive name?



##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java:
##
@@ -175,7 +181,25 @@ public Statistics estimateStatistics() {
   protected Statistics estimateStatistics(Snapshot snapshot) {
 // its a fresh table, no data
 if (snapshot == null) {
-  return new Stats(0L, 0L);
+  return new Stats(0L, 0L, Maps.newHashMap());
+}
+
+Map map = Maps.newHashMap();
+
+if (readConf.enableColumnStats()) {
+  List files = table.statisticsFiles();
+  if (!files.isEmpty()) {
+List metadataList = (files.get(0)).blobMetadata();
+
+for (BlobMetadata blobMetadata : metadataList) {
+  long ndv = Long.parseLong(blobMetadata.properties().get("ndv"));

Review Comment:
   I think we should be more defensive if there is no value, or 
numberformatexception?



##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java:
##
@@ -90,4 +90,8 @@ private SparkSQLProperties() {}
   public static final String EXECUTOR_CACHE_LOCALITY_ENABLED =
   "spark.sql.iceberg.executor-cache.locality.enabled";
   public static final boolean EXECUTOR_CACHE_LOCALITY_ENABLED_DEFAULT = false;
+
+  // Controls whether column statistics are enabled when estimating statistics

Review Comment:
   the comment seems not meaningful at first read.  Can we be more precise?



##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/ColStats.java:
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.source;
+
+import java.util.Optional;
+import java.util.OptionalLong;
+import org.apache.spark.sql.connector.read.colstats.ColumnStatistics;
+import org.apache.spark.sql.connector.read.colstats.Histogram;
+
+class ColStats implements ColumnStatistics {

Review Comment:
   Nit: this is a bit confusing, ColStats is just abbreviate the Spark class 
name.  How about SparkColumnStatistics?(Looks like in the package there are 
a lot of class called SparkXXX in this situation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubs

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



Fokko commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674710445


##
pyiceberg/io/pyarrow.py:
##
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: 
pa.Array) -> pa.Array:
 and pa.types.is_timestamp(values.type)
 and values.type.unit == "ns"
 ):
-return values.cast(target_type, safe=False)
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type, safe=False)
+if (
+pa.types.is_timestamp(target_type)
+and target_type.unit == "us"
+and pa.types.is_timestamp(values.type)
+and values.type.unit in {"s", "ms", "us"}

Review Comment:
   I think we might need to do some more work here. In Iceberg we're rather 
strict on the distinction between Timestamp and TimestampTZ. A good way of 
showing this can be found here:
   
   
https://github.com/apache/iceberg-python/blob/aceed2ad9898eaa8b0c883a195e58af5d0e3/pyiceberg/expressions/literals.py#L566-L572
   
   This is when we parse a string from a literal, which often comes an 
expression: `dt >= '1925-05-22T00:00:00'`. If the value has a timezone, even 
UTC, we reject it as `Timestamp`. When it has a timestamp, we [normalize it to 
UTC and then store the 
integer](https://github.com/apache/iceberg-python/blob/aceed2ad9898eaa8b0c883a195e58af5d0e3/pyiceberg/utils/datetime.py#L75).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Support writing to a branch [iceberg-python]

2024-07-11 Thread via GitHub



kevinjqliu commented on issue #306:
URL: https://github.com/apache/iceberg-python/issues/306#issuecomment-2224059233

   @vinjai yes! please go ahead. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [Docs] Add examples for DataFrame branch writes [iceberg]

2024-07-11 Thread via GitHub



szehon-ho commented on code in PR #10644:
URL: https://github.com/apache/iceberg/pull/10644#discussion_r1674784383


##
docs/docs/spark-writes.md:
##
@@ -332,6 +332,30 @@ The writer must enable the `mergeSchema` option.
 ```scala
 data.writeTo("prod.db.sample").option("mergeSchema","true").append()
 ```
+
+### Writing to Branches
+The branch must exist before performing write. The operation does **not** 
create the branch if it does not exist.

Review Comment:
   Initially, the "Writing to Branches" seemed a first level header (along with 
"Writing with SQL" and "Writing with Dataframes"), was it intentional?  Do you 
think we can just add an example using Dataframe here?
   
   My thought is, then we dont have to repeat this block of text, which we will 
have to do if we downgrade it to subsections within "Writing with SQL" and 
"Writing with Dataframes".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] support PyArrow timestamptz with Etc/UTC [iceberg-python]

2024-07-11 Thread via GitHub



syun64 commented on code in PR #910:
URL: https://github.com/apache/iceberg-python/pull/910#discussion_r1674797591


##
pyiceberg/io/pyarrow.py:
##
@@ -1296,31 +1297,49 @@ def to_requested_schema(
 
 
 class ArrowProjectionVisitor(SchemaWithPartnerVisitor[pa.Array, 
Optional[pa.Array]]):
-file_schema: Schema
+_file_schema: Schema
 _include_field_ids: bool
+_downcast_ns_timestamp_to_us: bool
 
 def __init__(self, file_schema: Schema, downcast_ns_timestamp_to_us: bool 
= False, include_field_ids: bool = False) -> None:
-self.file_schema = file_schema
+self._file_schema = file_schema
 self._include_field_ids = include_field_ids
-self.downcast_ns_timestamp_to_us = downcast_ns_timestamp_to_us
+self._downcast_ns_timestamp_to_us = downcast_ns_timestamp_to_us
 
 def _cast_if_needed(self, field: NestedField, values: pa.Array) -> 
pa.Array:
-file_field = self.file_schema.find_field(field.field_id)
+file_field = self._file_schema.find_field(field.field_id)
 
 if field.field_type.is_primitive:
 if field.field_type != file_field.field_type:
 return values.cast(
 schema_to_pyarrow(promote(file_field.field_type, 
field.field_type), include_field_ids=self._include_field_ids)
 )
 elif (target_type := schema_to_pyarrow(field.field_type, 
include_field_ids=self._include_field_ids)) != values.type:
-# Downcasting of nanoseconds to microseconds
-if (
-pa.types.is_timestamp(target_type)
-and target_type.unit == "us"
-and pa.types.is_timestamp(values.type)
-and values.type.unit == "ns"
-):
-return values.cast(target_type, safe=False)
+if field.field_type == TimestampType():

Review Comment:
   stricter and clearer `field_type` driven checks



##
tests/io/test_pyarrow.py:
##
@@ -1798,3 +1799,35 @@ def test_identity_partition_on_multi_columns() -> None:
 ("n_legs", "ascending"),
 ("animal", "ascending"),
 ]) == arrow_table.sort_by([("born_year", "ascending"), ("n_legs", 
"ascending"), ("animal", "ascending")])
+
+
+def test_to_requested_schema_timestamps(

Review Comment:
   I noticed that there weren't any tests for `to_requested_schema` although it 
is a public API, so I added this in.
   
   Do we want to keep `to_requested_schema` as a public API?



##
pyiceberg/io/pyarrow.py:
##
@@ -1320,7 +1321,16 @@ def _cast_if_needed(self, field: NestedField, values: 
pa.Array) -> pa.Array:
 and pa.types.is_timestamp(values.type)
 and values.type.unit == "ns"
 ):
-return values.cast(target_type, safe=False)
+if target_type.tz == "UTC" and values.type.tz in 
UTC_ALIASES or not target_type.tz and not values.type.tz:
+return values.cast(target_type, safe=False)
+if (
+pa.types.is_timestamp(target_type)
+and target_type.unit == "us"
+and pa.types.is_timestamp(values.type)
+and values.type.unit in {"s", "ms", "us"}

Review Comment:
   Sounds good @Fokko- thank you for the review.
   I've made these checks stricter and also clearer for users to follow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Bump coverage from 7.5.4 to 7.6.0 [iceberg-python]

2024-07-11 Thread via GitHub



dependabot[bot] opened a new pull request, #917:
URL: https://github.com/apache/iceberg-python/pull/917

   Bumps [coverage](https://github.com/nedbat/coveragepy) from 7.5.4 to 7.6.0.
   
   Changelog
   Sourced from https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst";>coverage's 
changelog.
   
   Version 7.6.0 — 2024-07-11
   
   
   Exclusion patterns can now be multi-line, thanks to Daniel Diniz 
.  This enables many interesting exclusion 
use-cases, including those
   requested in issues 118  (entire files), 
996 _ (multiple lines only when appearing 
together), 1741 _ (remainder of a function), 
and 1803 _
   (arbitrary sequence of marked lines).  See the 
:ref:multi_line_exclude
   section of the docs for more details and examples.
   
   
   The JSON report now includes per-function and per-class coverage 
information.
   Thanks to Daniel Diniz _ for getting the work 
started. This
   closes issue 1793_ and issue 1532_.
   
   
   Fixed an incorrect calculation of "(no class)" lines in the 
HTML classes
   report.
   
   
   Python 3.13.0b3 is supported.
   
   
   .. _issue 118: https://redirect.github.com/nedbat/coveragepy/issues/118";>nedbat/coveragepy#118
   .. _issue 996: https://redirect.github.com/nedbat/coveragepy/issues/996";>nedbat/coveragepy#996
   .. _issue 1532: https://redirect.github.com/nedbat/coveragepy/issues/1532";>nedbat/coveragepy#1532
   .. _issue 1741: https://redirect.github.com/nedbat/coveragepy/issues/1741";>nedbat/coveragepy#1741
   .. _issue 1793: https://redirect.github.com/nedbat/coveragepy/issues/1793";>nedbat/coveragepy#1793
   .. _issue 1803: https://redirect.github.com/nedbat/coveragepy/issues/1803";>nedbat/coveragepy#1803
   .. _pull 1807: https://redirect.github.com/nedbat/coveragepy/pull/1807";>nedbat/coveragepy#1807
   .. _pull 1809: https://redirect.github.com/nedbat/coveragepy/pull/1809";>nedbat/coveragepy#1809
   .. _changes_7-5-4:
   
   
   
   Commits
   
   https://github.com/nedbat/coveragepy/commit/59a3cd7cecbf45378b9d1f4eda90826258233d62";>59a3cd7
 docs: sample HTML for 7.6.0
   https://github.com/nedbat/coveragepy/commit/7f27fa7810e494a75af08b54c4f97b94df7364f4";>7f27fa7
 docs: prep for 7.6.0
   https://github.com/nedbat/coveragepy/commit/6a268b059515e2768931ce6454dcd27304520d8a";>6a268b0
 docs: issues closed by the json region reporting
   https://github.com/nedbat/coveragepy/commit/5bfe9e770304e0b0b346de2441c83300f9da0edf";>5bfe9e7
 chore: bump actions/setup-python from 5.1.0 to 5.1.1 (https://redirect.github.com/nedbat/coveragepy/issues/1814";>#1814)
   https://github.com/nedbat/coveragepy/commit/ab609ef0fb235454050cf8383427ce5f1b0ec8e9";>ab609ef
 docs: mention json region reporting in the changes
   https://github.com/nedbat/coveragepy/commit/92d96b91b78639cdb50cbba9f7848dd9e75382d7";>92d96b9
 fix: json report needs 'no class' and 'no function' also
   https://github.com/nedbat/coveragepy/commit/e47e7e758bfc48537f0f21d40cef8e5fa2a076c6";>e47e7e7
 refactor: move duplicate code into methods
   https://github.com/nedbat/coveragepy/commit/3d6be2b3284d30d1668afeeb383430ddc402ce4d";>3d6be2b
 fix: json format should bump for regions
   https://github.com/nedbat/coveragepy/commit/a9992d2bff1f12db61c114e6d61d5f35873ae84a";>a9992d2
 test: add a test of json regions with branches
   https://github.com/nedbat/coveragepy/commit/8b8976462b8b7be74716e83efaf05e22f477ef72";>8b89764
 test: json expectations should have explicit format number
   Additional commits viewable in https://github.com/nedbat/coveragepy/compare/7.5.4...7.6.0";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=coverage&package-manager=pip&previous-version=7.5.4&new-version=7.6.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ig

1 2 >

1 - 100 of 141 matches

Mail list logo