[PR] Core: Add Catalog Transactions API [iceberg]

2024-09-09 Thread via GitHub


nastra opened a new pull request, #6948:
URL: https://github.com/apache/iceberg/pull/6948

   I have written up a design doc, which is available 
[here](https://docs.google.com/document/d/1UxXifU8iqP_byaW4E2RuKZx1nobxmAvc5urVcWas1B8/edit?usp=sharing)
   
   I think eventually we'd want to split this into 2 PRs, so that the 
introduced APIs can be reviewed independently from the REST-specific things.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Reassign field ids for schema [iceberg-rust]

2024-09-09 Thread via GitHub


c-thiel commented on code in PR #615:
URL: https://github.com/apache/iceberg-rust/pull/615#discussion_r1749940779


##
crates/iceberg/src/spec/schema.rs:
##
@@ -86,6 +87,16 @@ impl SchemaBuilder {
 self
 }
 
+/// Reassign all field-ids (nested) on build.
+/// If `start_from` is provided, it will start reassigning from that id 
(inclusive).
+/// If not provided, it will start from 0.
+///
+/// All specified aliases and identifier fields will be updated to the new 
field-ids.
+pub fn with_reassigned_field_ids(mut self, start_from: Option) -> 
Self {
+self.reassign_field_ids_from = Some(start_from.unwrap_or(0));

Review Comment:
   @Xuanwo only doubt with the i32 is that people might input 1 as something is 
required - and it's shorter than Default::default().
   The rust tests are quite a good example that this can happen: We have many 
tests where field numbering starts from 1, so I had to re-write them to be 
compatible with the new builder.
   
   I am OK with the i32 as well, just wanted to bring up that a None is 
probably more verbose.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] DRAFT - Issue 10275 - Reward support for nulls [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #10953:
URL: https://github.com/apache/iceberg/pull/10953#discussion_r1749992612


##
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java:
##
@@ -140,14 +141,16 @@ public static class ConstantVectorHolder extends 
VectorHolder {
 private final int numRows;
 
 public ConstantVectorHolder(int numRows) {
+  super(new NullVector("_dummy_", numRows), null, null);
   this.numRows = numRows;
   this.constantValue = null;
 }
 
 public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T 
constantValue) {
-  super(icebergField);
+  super(new NullVector(icebergField.name(), numRows), icebergField, new 
NullabilityHolder(numRows));
   this.numRows = numRows;
   this.constantValue = constantValue;
+  this.vector = new NullVector(icebergField.name(), numRows);

Review Comment:
   I don't think we should be creating two instances of `NullVector`. Either 
use the one you pass via `super(...)` or this one, but not both



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] DRAFT - Issue 10275 - Reward support for nulls [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #10953:
URL: https://github.com/apache/iceberg/pull/10953#discussion_r1749997916


##
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactoryTest.java:
##
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.arrow.vectorized;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+import java.math.BigDecimal;
+import java.util.function.Supplier;
+import org.apache.arrow.vector.IntVector;
+import org.apache.iceberg.types.Types;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.schema.PrimitiveType;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.mockito.InjectMocks;
+import org.mockito.Mock;
+import org.mockito.MockitoAnnotations;
+
+class GenericArrowVectorAccessorFactoryTest {
+  @Mock
+  Supplier> 
decimalFactorySupplier;
+
+  @Mock Supplier> 
stringFactorySupplier;
+
+  @Mock
+  Supplier>
+  structChildFactorySupplier;
+
+  @Mock
+  Supplier> 
arrayFactorySupplier;
+
+  @InjectMocks GenericArrowVectorAccessorFactory 
genericArrowVectorAccessorFactory;
+
+  @BeforeEach
+  void before() {
+MockitoAnnotations.openMocks(this);
+  }
+
+  @Test
+  void testGetVectorAccessorWithIntVector() {
+IntVector vector = mock(IntVector.class);
+when(vector.get(0)).thenReturn(88);
+
+Types.NestedField nestedField = Types.NestedField.optional(0, "a1", 
Types.IntegerType.get());
+ColumnDescriptor columnDescriptor =
+new ColumnDescriptor(
+new String[] {nestedField.name()}, 
PrimitiveType.PrimitiveTypeName.INT32, 0, 1);
+NullabilityHolder nullabilityHolder = new NullabilityHolder(1);
+VectorHolder vectorHolder =
+new VectorHolder(columnDescriptor, vector, false, null, 
nullabilityHolder, nestedField);
+ArrowVectorAccessor actual = 
genericArrowVectorAccessorFactory.getVectorAccessor(vectorHolder);
+assertThat(actual).isNotNull();
+assertThat(actual).isInstanceOf(ArrowVectorAccessor.class);
+int intValue = actual.getInt(0);
+assertThat(intValue).isEqualTo(88);
+  }
+
+  @Test
+  void testGetVectorAccessorWithNullVector() {
+assertThatThrownBy(
+() -> {

Review Comment:
   the {} can be removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] DRAFT - Issue 10275 - Reward support for nulls [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #10953:
URL: https://github.com/apache/iceberg/pull/10953#discussion_r1749998444


##
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactoryTest.java:
##
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.arrow.vectorized;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+import java.math.BigDecimal;
+import java.util.function.Supplier;
+import org.apache.arrow.vector.IntVector;
+import org.apache.iceberg.types.Types;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.schema.PrimitiveType;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.mockito.InjectMocks;
+import org.mockito.Mock;
+import org.mockito.MockitoAnnotations;
+
+class GenericArrowVectorAccessorFactoryTest {
+  @Mock
+  Supplier> 
decimalFactorySupplier;
+
+  @Mock Supplier> 
stringFactorySupplier;
+
+  @Mock
+  Supplier>
+  structChildFactorySupplier;
+
+  @Mock
+  Supplier> 
arrayFactorySupplier;
+
+  @InjectMocks GenericArrowVectorAccessorFactory 
genericArrowVectorAccessorFactory;
+
+  @BeforeEach
+  void before() {
+MockitoAnnotations.openMocks(this);
+  }
+
+  @Test
+  void testGetVectorAccessorWithIntVector() {
+IntVector vector = mock(IntVector.class);
+when(vector.get(0)).thenReturn(88);
+
+Types.NestedField nestedField = Types.NestedField.optional(0, "a1", 
Types.IntegerType.get());
+ColumnDescriptor columnDescriptor =
+new ColumnDescriptor(
+new String[] {nestedField.name()}, 
PrimitiveType.PrimitiveTypeName.INT32, 0, 1);
+NullabilityHolder nullabilityHolder = new NullabilityHolder(1);
+VectorHolder vectorHolder =
+new VectorHolder(columnDescriptor, vector, false, null, 
nullabilityHolder, nestedField);
+ArrowVectorAccessor actual = 
genericArrowVectorAccessorFactory.getVectorAccessor(vectorHolder);
+assertThat(actual).isNotNull();
+assertThat(actual).isInstanceOf(ArrowVectorAccessor.class);

Review Comment:
   ```suggestion
   assertThat(actual).isNotNull().isInstanceOf(ArrowVectorAccessor.class);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] TableMetadataBuilder [iceberg-rust]

2024-09-09 Thread via GitHub


c-thiel commented on PR #587:
URL: https://github.com/apache/iceberg-rust/pull/587#issuecomment-2337769487

   @liurenjie1024 thanks for the Feedback!
   
   > > My first point of the opening statement: Do we re-write our SortOrder 
and add the schema to PartitionSpec so that we can match on names like Java 
does or not?
   > 
   > To be honest, I don't quite understand the use case. We can ask for 
background of this in dev channel, but I think this is not a blocker of this 
pr, we can always add this later.
   
   The problem in changing it later is that it changes the semantic of the 
function. Right now we expect source_id to match the `current_schema()` (which 
is also the reason why I expose it during build). Java doesn't do this, 
instead, it looks up the name by id in the schema bound to a SortOrder or 
PartitionSpec, and then searches for the same name in the new schema to use it.
   
   In my opinion ids are much cleaner than names (we might have dropped and 
re-added a column with the same name in the meantime), so I am OK with going 
forward. However, moving over to java semantics will require new endpoints 
(i.e. `add_migrate_partition_spec` or so), which takes a bound partition spec 
in contrast to the unbound spec we currently pass in. 
   
   Give me a thumbs up if that's OK for you. I'll also open a discussion in the 
dev channel to get some more opinions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Feat: Normalize TableMetadata [iceberg-rust]

2024-09-09 Thread via GitHub


c-thiel commented on PR #611:
URL: https://github.com/apache/iceberg-rust/pull/611#issuecomment-2337827094

   > Thank you very much for working on this. I have a few style suggestions, 
but please feel free to disregard them if they don't suit you.
   
   Thanks @Xuanwo, they all make sense! Addressed in [Improve readability & 
comments](https://github.com/apache/iceberg-rust/pull/611/commits/65074a1ea7e3eaaf52ca148bbf4fda7ecbda0c8c)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Case sensitivity is not respected when using IcebergGenerics.ScanBuilder [iceberg]

2024-09-09 Thread via GitHub


mderoy closed issue #8178: Case sensitivity is not respected when using 
IcebergGenerics.ScanBuilder
URL: https://github.com/apache/iceberg/issues/8178


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.3, 3.4: Fix incorrect catalog loaded in TestCreateActions [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #11049:
URL: https://github.com/apache/iceberg/pull/11049#discussion_r1750181394


##
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java:
##
@@ -728,6 +733,8 @@ public void testStructOfThreeLevelLists() throws Exception {
 
   @Test
   public void testTwoLevelList() throws IOException {
+Assume.assumeTrue("Cannot migrate to a hadoop based catalog", 
!type.equals("hadoop"));

Review Comment:
   why not use Assumptions from AssertJ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Library public api isolation and import decoupling [iceberg-python]

2024-09-09 Thread via GitHub


ndrluis closed issue #499: Library public api isolation and import decoupling
URL: https://github.com/apache/iceberg-python/issues/499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Library public api isolation and import decoupling [iceberg-python]

2024-09-09 Thread via GitHub


ndrluis commented on issue #499:
URL: https://github.com/apache/iceberg-python/issues/499#issuecomment-2338019734

   I'll close this issue because we added the mypy linter, which solves our 
problem with coupling, and we have #1099 to solve the public API definition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Feat: Normalize TableMetadata [iceberg-rust]

2024-09-09 Thread via GitHub


Xuanwo merged PR #611:
URL: https://github.com/apache/iceberg-rust/pull/611


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] TableMetadataBuilder [iceberg-rust]

2024-09-09 Thread via GitHub


Xuanwo commented on PR #587:
URL: https://github.com/apache/iceberg-rust/pull/587#issuecomment-2338031867

   I have reviewed most PRs that I am confident can be merged. The only one 
left is https://github.com/apache/iceberg-rust/pull/615, for which I need more 
input.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Remove Hive 2 [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #10996:
URL: https://github.com/apache/iceberg/pull/10996#discussion_r1750198331


##
mr/src/main/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergObjectInspector.java:
##
@@ -27,33 +27,23 @@
 import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.common.DynMethods;
-import org.apache.iceberg.hive.HiveVersion;
 import org.apache.iceberg.types.Type;
 import org.apache.iceberg.types.TypeUtil;
 import org.apache.iceberg.types.Types;
 
 public final class IcebergObjectInspector extends 
TypeUtil.SchemaVisitor {
 
-  // get the correct inspectors depending on whether we're working with Hive2 
or Hive3 dependencies
-  // we need to do this because there is a breaking API change in 
Date/TimestampObjectInspector
-  // between Hive2 and Hive3
   private static final String DATE_INSPECTOR_CLASS =
-  HiveVersion.min(HiveVersion.HIVE_3)
-  ? 
"org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspectorHive3"
-  : 
"org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspector";
+  
"org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspector";

Review Comment:
   is this the correct one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Remove Hive 2 [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #10996:
URL: https://github.com/apache/iceberg/pull/10996#discussion_r1750200212


##
mr/src/main/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergObjectInspector.java:
##
@@ -27,33 +27,23 @@
 import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.common.DynMethods;
-import org.apache.iceberg.hive.HiveVersion;
 import org.apache.iceberg.types.Type;
 import org.apache.iceberg.types.TypeUtil;
 import org.apache.iceberg.types.Types;
 
 public final class IcebergObjectInspector extends 
TypeUtil.SchemaVisitor {
 
-  // get the correct inspectors depending on whether we're working with Hive2 
or Hive3 dependencies
-  // we need to do this because there is a breaking API change in 
Date/TimestampObjectInspector
-  // between Hive2 and Hive3
   private static final String DATE_INSPECTOR_CLASS =
-  HiveVersion.min(HiveVersion.HIVE_3)
-  ? 
"org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspectorHive3"
-  : 
"org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspector";
+  
"org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspector";

Review Comment:
   maybe just rename those classes so that they don't have the `Hive` suffix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] scan: fix error when reading an empty table [iceberg-rust]

2024-09-09 Thread via GitHub


sdd commented on PR #608:
URL: https://github.com/apache/iceberg-rust/pull/608#issuecomment-2338064061

   Just to clarify, not having any snapshots is not necessarily the same as not 
having any data. If there is no current snapshot then there can't be any data, 
but someone could delete all data from a table, resulting in there being a 
snapshot, but no data. The existing code would handle this second case just 
fine - we only need to handle the issue of no snapshots. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add support for `view-default` property in catalog [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #11064:
URL: https://github.com/apache/iceberg/pull/11064#discussion_r1750254414


##
open-api/src/testFixtures/java/org/apache/iceberg/rest/RCKUtils.java:
##
@@ -85,7 +85,8 @@ static RESTCatalog initCatalogClient() {
 catalogProperties.putIfAbsent(
 CatalogProperties.URI,
 String.format("http://localhost:%s/";, 
RESTCatalogServer.REST_PORT_DEFAULT));
-catalogProperties.putIfAbsent(CatalogProperties.WAREHOUSE_LOCATION, 
"rck_warehouse");

Review Comment:
   is this intentionally being removed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add support for `view-default` property in catalog [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #11064:
URL: https://github.com/apache/iceberg/pull/11064#discussion_r1750255921


##
core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java:
##
@@ -107,6 +108,7 @@ public void basicCreateView() {
 assertThat(view.currentVersion().operation()).isEqualTo("create");
 assertThat(view.schemas()).hasSize(1).containsKey(0);
 
assertThat(view.versions()).hasSize(1).containsExactly(view.currentVersion());
+assertThat(view.properties()).contains(entry("key1", 
"catalog-default-key1"));

Review Comment:
   ```suggestion
   assertThat(view.properties()).containsEntry("key1", 
"catalog-default-key1");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration [iceberg]

2024-09-09 Thread via GitHub


pvary commented on code in PR #10926:
URL: https://github.com/apache/iceberg/pull/10926#discussion_r1750267503


##
core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java:
##
@@ -63,7 +63,11 @@ public class HadoopFileIO implements HadoopConfigurable, 
DelegateFileIO {
* {@link Configuration Hadoop configuration} must be set through {@link
* HadoopFileIO#setConf(Configuration)}
*/
-  public HadoopFileIO() {}
+  public HadoopFileIO() {
+// Create a default hadoopConf as it is required for the object to be 
valid.
+// E.g. newInputFile would throw NPE with hadoopConf.get() otherwise.
+this.hadoopConf = new SerializableConfiguration(new Configuration())::get;

Review Comment:
   @stevenzwu: 
   > if I understand correctly, your point is that FileIO probably shouldn't be 
part of the task state for BaseEntriesTable.ManifestReadTask, 
AllManifestsTable.ManifestListReadTask? [..] It seems like a large/challenging 
refactoring. Looking for other folks' take on this.
   
   Yes, this seems strange, and problematic to me. I was not able to find an 
easy solution yet. I was hoping, others with better knowledge might have some 
ideas.
   
   > Regardless, the issue remains that FileIOParser doesn't serialize 
HadoopFileIO faithfully. I don't know if REST catalog has any need to use it to 
JSON serialize FileIO in the future.
   
   My point here is that, since we use `Configuration(false)` in some cases, 
and the way how the current serialization works, we already `doesn't serialize 
HadoopFileIO faithfully`. So If we don't find a solution for getting rid of the 
FileIO, we might as well write our own "unfaithful" serialization which mimics 
the way how the current serialization works.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Avoid metaspace memory leak by not registering ShutdownHook for ExecutorService in Flink [iceberg]

2024-09-09 Thread via GitHub


pvary commented on PR #11073:
URL: https://github.com/apache/iceberg/pull/11073#issuecomment-2338174338

   @fengjiajie: Does this mean, that you have submitting 1091 jobs in the span 
of 2 minutes?
   I see that the default timeout for the 
`MoreExecutors.getExitingExecutorService` is 120s
   ```
   final ExecutorService getExitingExecutorService(ThreadPoolExecutor 
executor) {
   return this.getExitingExecutorService(executor, 120L, 
TimeUnit.SECONDS);
   }
   ```
   
   Or is this a guava issue, that the application is not cleared even after the 
timeout is reached?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Add configuration and set better defaults for S3 retry behaviour [iceberg]

2024-09-09 Thread via GitHub


nastra commented on code in PR #11052:
URL: https://github.com/apache/iceberg/pull/11052#discussion_r1750335281


##
aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIOProperties.java:
##
@@ -491,4 +493,17 @@ public void testApplyUserAgentConfigurations() {
 Mockito.verify(mockS3ClientBuilder)
 .overrideConfiguration(Mockito.any(ClientOverrideConfiguration.class));
   }
+
+  @Test
+  public void testApplyRetryConfiguration() {
+Map properties = Maps.newHashMap();
+properties.put(S3FileIOProperties.S3_RETRY_NUM_RETRIES, "999");
+S3FileIOProperties s3FileIOProperties = new S3FileIOProperties(properties);
+
+S3ClientBuilder builder = S3Client.builder();
+s3FileIOProperties.applyRetryConfigurations(builder);
+
+RetryPolicy retryPolicy = 
builder.overrideConfiguration().retryPolicy().get();
+assertThat(retryPolicy.numRetries()).withFailMessage("retries was not 
set").isEqualTo(999);

Review Comment:
   ```suggestion
   assertThat(retryPolicy.numRetries()).as("retries was not 
set").isEqualTo(999);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Spark Streaming Job with multiple queries MERGE INTO the same target table (Runtime file filtering is not possible) [iceberg]

2024-09-09 Thread via GitHub


RussellSpitzer commented on issue #11094:
URL: https://github.com/apache/iceberg/issues/11094#issuecomment-2338247842

   > @eric-maynard I have other examples where I do what you suggest. If I move 
this inside `start_streaming_query` the builder will simply return the existing 
spark session `getOrCreate` it will not create a new spark application.
   > 
   > When I do so I see the same behavior.
   
   You need to create a new session object, not get the active one like so:
   ```scala
   scala> spark.newSession()
   res0: org.apache.spark.sql.SparkSession = 
org.apache.spark.sql.SparkSession@5d643896
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Spark configuration for amazon access key and secret key with glue catalog for apache Iceberg is not honoring [iceberg]

2024-09-09 Thread via GitHub


andythsu commented on issue #10078:
URL: https://github.com/apache/iceberg/issues/10078#issuecomment-2338251099

   @clamar14 @nastra I ended up using different jars and putting everything 
together instead of using `iceberg-aws-bundle` 
   ```
   def get_builder() -> SparkSession.Builder:
   return (
   SparkSession.builder \
   .config(
   "spark.jars",
   (
   f"{JARS_BASE_PATH}/iceberg-spark-runtime-3.4_2.12-1.6.0.jar"
   f",{JARS_BASE_PATH}/iceberg-spark-extensions-3.4_2.12-1.6.0.jar"
   f",{JARS_BASE_PATH}/hadoop-aws-3.3.2.jar" # needed for hadoop s3 
file system
   f",{JARS_BASE_PATH}/aws-java-sdk-bundle-1.12.769.jar"
   f",{JARS_BASE_PATH}/protobuf-java-3.2.0.jar"
   )
   ...
   )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: add option to force columns to lowercase [iceberg]

2024-09-09 Thread via GitHub


bryanck commented on PR #11100:
URL: https://github.com/apache/iceberg/pull/11100#issuecomment-2338262871

   Note, this addresses https://github.com/apache/iceberg/issues/11091


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.3, 3.4: Fix incorrect catalog loaded in TestCreateActions [iceberg]

2024-09-09 Thread via GitHub


nastra merged PR #11049:
URL: https://github.com/apache/iceberg/pull/11049


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka Connect: add option to force columns to lowercase [iceberg]

2024-09-09 Thread via GitHub


bryanck commented on code in PR #11100:
URL: https://github.com/apache/iceberg/pull/11100#discussion_r1750375359


##
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/IcebergSinkConfig.java:
##
@@ -79,6 +79,8 @@ public class IcebergSinkConfig extends AbstractConfig {
   "iceberg.tables.schema-force-optional";
   private static final String TABLES_SCHEMA_CASE_INSENSITIVE_PROP =
   "iceberg.tables.schema-case-insensitive";
+  private static final String TABLES_SCHEMA_FORCE_COLUMNS_TO_LOWERCASE_PROP =
+  "iceberg.tables.schema-force-columns-to-lowercase";

Review Comment:
   I think `iceberg.tables.schema-force-lowercase` is sufficient and consistent 
with `schema-force-optional`. Also, moving this up one to be below 
`schema-force optional` would be preferable.



##
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/IcebergSinkConfig.java:
##
@@ -162,6 +164,12 @@ private static ConfigDef newConfigDef() {
 false,
 Importance.MEDIUM,
 "Set to true to look up table columns by case-insensitive name, false 
for case-sensitive");
+configDef.define(
+TABLES_SCHEMA_FORCE_COLUMNS_TO_LOWERCASE_PROP,
+ConfigDef.Type.BOOLEAN,
+false,
+Importance.MEDIUM,
+"Set to true to force table column names to lowercase, false for 
preserve case");

Review Comment:
   Same note here on name and moving it up one element to be next to "force 
optional".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Kafka Connect: Record projection Index out of bounds error [iceberg]

2024-09-09 Thread via GitHub


bryanck commented on issue #11099:
URL: https://github.com/apache/iceberg/issues/11099#issuecomment-2338355847

   Thanks for reporting this, does this require a fix or was it something else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]

2024-09-09 Thread via GitHub


isc-patrick opened a new issue, #1148:
URL: https://github.com/apache/iceberg-python/issues/1148

   ### Feature Request / Improvement
   
   I think there are 2 issues with the SQLCatalog constructor:
   
   
https://github.com/apache/iceberg-python/blob/d587e6724685744918ecf192724437182ad01abf/pyiceberg/catalog/sql.py#L117
   
   1. The automatic creation of non-existent SQLCatalog tables is great for 
testing, but not something I would want in code that was being deployed and run 
even in a QA environ. Applying DDL should be part of a seperate process. 
Constructor should at least take a parameter that allows you to not create the 
tables automatically.
   2. The current method used for creating the tables in _ensure_tables_exists 
uses exceptions which means that the library is compatible with SQLite and 
Postgres simply because the except clause handles the errors these DB's throw 
when selecting from a non-existant DB. The logic here should just check for the 
existence of the tables through a standard SQLAlchemy method, which will also 
make the SQLCatalog work with most DBs
   
   I have made these changes on my local branch and can submit a PR but wanted 
to open up conversation to other opinions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Fix the behavior of IncrementalFileCleanup when expire a snapshot [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar merged PR #10983:
URL: https://github.com/apache/iceberg/pull/10983


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


danielcweeks commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750551470


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   iirc, there are two reasons:
   1. We want the client to be explicit about which snapshot they want to read
   2. This simplifies the request because there's only one way to read 
historical data



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750572264


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   Yeah, clients currently specify the snapshot ID however there needs to be a 
mechanism for distinguishing which schema gets used based on if it's a time 
travel by a specific snapshot ID or if it's a time travel by branch. The client 
has that context, and it's easier for it to determine which schema should be 
used. The request input is kept simpler by having just a snapshot ID for time 
travel as @danielcweeks said rather than having a mix of different options.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750583821


##
open-api/rest-catalog-open-api.yaml:
##
@@ -541,6 +541,264 @@ paths:
 5XX:
   $ref: '#/components/responses/ServerErrorResponse'
 
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan:
+parameters:
+  - $ref: '#/components/parameters/prefix'
+  - $ref: '#/components/parameters/namespace'
+  - $ref: '#/components/parameters/table'
+post:
+  tags:
+- Catalog API
+  summary: Submit a scan for planning
+  description: >
+Submits a scan for server-side planning.
+
+
+Point-in-time scans are planned by passing snapshot-id to identify the
+table snapshot to scan. Incremental scans are planned by passing both
+start-snapshot-id and end-snapshot-id. Requests that include both point
+in time config properties and incremental config properties are
+invalid. If the request does not include either incremental or
+point-in-time config properties, scan planning should produce a
+point-in-time scan of the latest snapshot in the table's main branch.
+
+
+Responses must include a valid status
+
+- When "completed" the planning operation has produced plan tasks and
+  file scan tasks that must be returned in the response (not fetched
+  later by calling fetchPlanningResult)
+
+- When "submitted" the response must include a plan-id used to poll
+  fetchPlanningResult to fetch the planning result when it is ready
+
+- When "failed" the response must be a valid error response
+
+- Status "cancelled" is not a valid status from this endpoint
+
+
+The response for a "completed" planning operation includes two types of
+tasks (file scan tasks and plan tasks) and both may be included in the
+response. Tasks must not be included for any other response status.
+
+
+Responses that include a plan-id indicate that the service is holding
+state or performing work for the client.
+
+
+- Clients should use the plan-id to fetch results from
+  fetchPlanningResult when the response status is "submitted"
+
+- Clients should inform the service if planning results are no longer
+  needed by calling cancelPlanning. Cancellation is not necessary after
+  fetchScanTasks has been used to fetch scan tasks for each plan task.
+  operationId: planTableScan
+  requestBody:
+content:
+  application/json:
+schema:
+  $ref: '#/components/schemas/PlanTableScanRequest'
+  responses:
+200:
+  $ref: '#/components/responses/PlanTableScanResponse'
+400:
+  $ref: '#/components/responses/BadRequestErrorResponse'
+401:
+  $ref: '#/components/responses/UnauthorizedResponse'
+403:
+  $ref: '#/components/responses/ForbiddenResponse'
+404:
+  description:
+Not Found
+- NoSuchTableException, the table does not exist
+- NoSuchNamespaceException, the namespace does not exist
+  content:
+application/json:
+  schema:
+$ref: '#/components/schemas/IcebergErrorResponse'
+  examples:
+TableDoesNotExist:
+  $ref: '#/components/examples/NoSuchTableError'
+NamespaceDoesNotExist:
+  $ref: '#/components/examples/NoSuchNamespaceError'
+406:
+  $ref: '#/components/responses/UnsupportedOperationResponse'
+419:
+  $ref: '#/components/responses/AuthenticationTimeoutResponse'
+503:
+  $ref: '#/components/responses/ServiceUnavailableResponse'
+5XX:
+  $ref: '#/components/responses/ServerErrorResponse'
+
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}:
+parameters:
+  - $ref: '#/components/parameters/prefix'
+  - $ref: '#/components/parameters/namespace'
+  - $ref: '#/components/parameters/table'
+  - $ref: '#/components/parameters/plan-id'
+
+get:
+  tags:
+- Catalog API
+  summary: Fetches the result of scan planning for a plan-id
+  operationId: fetchPlanningResult
+  description: >
+Fetches the result of scan planning for a plan-id.
+
+
+Responses must include a valid status
+
+- When "completed" the planning operation has produced plan-tasks and
+  file-scan-tasks that must be returned in the response
+
+- When "submitted" the planning operation has not completed; the client
+  should wait to call this endpoint again to fetch a completed response
+
+- When "failed" the response must be a valid error response
+
+- When "cancelled" the plan-id is invalid and should be discarded
+
+
+The respons

Re: [I] Data files which are still useful are mistakenly cleaned up when trying to expire a specified snapshot [iceberg]

2024-09-09 Thread via GitHub


hantangwangd closed issue #10982: Data files which are still useful are 
mistakenly cleaned up when trying to expire a specified snapshot
URL: https://github.com/apache/iceberg/issues/10982


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


flyrain commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750651482


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   Thanks @danielcweeks and @amogh-jahagirdar for inputs. Sorry I didn't 
realized there are differences between a time travel by a specific snapshot ID 
and by branch name, in terms of which schema to use.
   
   To help clarify, could we add a link here for further reading, or provide a 
detailed explanation? For example: 
   ```
   - When scanning with a snapshot ID, clients should use the snapshot schema 
(true).
   - When scanning with a branch name, clients should use the table schema 
(false). Note that clients still send the snapshot ID rather than the branch 
name in the request.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] DRAFT - Issue 10275 - Reward support for nulls [iceberg]

2024-09-09 Thread via GitHub


slessard commented on code in PR #10953:
URL: https://github.com/apache/iceberg/pull/10953#discussion_r1750566040


##
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java:
##
@@ -262,6 +264,111 @@ public void testReadColumnFilter2() throws Exception {
 scan, NUM_ROWS_PER_MONTH, 12 * NUM_ROWS_PER_MONTH, 
ImmutableList.of("timestamp"));
   }
 
+  @Test
+  public void testReadColumnThatDoesNotExistInParquetSchema() throws Exception 
{
+setMaxStackTraceElementsDisplayed(15);

Review Comment:
   This tests the exact scenario that is the impetus for adding support for 
null vectors. I feel this test is essential. This test is the only test 
exercising this pull request. If you see things differently, please explain.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


flyrain commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750675084


##
open-api/rest-catalog-open-api.yaml:
##
@@ -541,6 +541,264 @@ paths:
 5XX:
   $ref: '#/components/responses/ServerErrorResponse'
 
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan:
+parameters:
+  - $ref: '#/components/parameters/prefix'
+  - $ref: '#/components/parameters/namespace'
+  - $ref: '#/components/parameters/table'
+post:
+  tags:
+- Catalog API
+  summary: Submit a scan for planning
+  description: >
+Submits a scan for server-side planning.
+
+
+Point-in-time scans are planned by passing snapshot-id to identify the
+table snapshot to scan. Incremental scans are planned by passing both
+start-snapshot-id and end-snapshot-id. Requests that include both point
+in time config properties and incremental config properties are
+invalid. If the request does not include either incremental or
+point-in-time config properties, scan planning should produce a
+point-in-time scan of the latest snapshot in the table's main branch.
+
+
+Responses must include a valid status
+
+- When "completed" the planning operation has produced plan tasks and
+  file scan tasks that must be returned in the response (not fetched
+  later by calling fetchPlanningResult)
+
+- When "submitted" the response must include a plan-id used to poll
+  fetchPlanningResult to fetch the planning result when it is ready
+
+- When "failed" the response must be a valid error response
+
+- Status "cancelled" is not a valid status from this endpoint
+
+
+The response for a "completed" planning operation includes two types of
+tasks (file scan tasks and plan tasks) and both may be included in the
+response. Tasks must not be included for any other response status.
+
+
+Responses that include a plan-id indicate that the service is holding
+state or performing work for the client.
+
+
+- Clients should use the plan-id to fetch results from
+  fetchPlanningResult when the response status is "submitted"
+
+- Clients should inform the service if planning results are no longer
+  needed by calling cancelPlanning. Cancellation is not necessary after
+  fetchScanTasks has been used to fetch scan tasks for each plan task.
+  operationId: planTableScan
+  requestBody:
+content:
+  application/json:
+schema:
+  $ref: '#/components/schemas/PlanTableScanRequest'
+  responses:
+200:
+  $ref: '#/components/responses/PlanTableScanResponse'
+400:
+  $ref: '#/components/responses/BadRequestErrorResponse'
+401:
+  $ref: '#/components/responses/UnauthorizedResponse'
+403:
+  $ref: '#/components/responses/ForbiddenResponse'
+404:
+  description:
+Not Found
+- NoSuchTableException, the table does not exist
+- NoSuchNamespaceException, the namespace does not exist
+  content:
+application/json:
+  schema:
+$ref: '#/components/schemas/IcebergErrorResponse'
+  examples:
+TableDoesNotExist:
+  $ref: '#/components/examples/NoSuchTableError'
+NamespaceDoesNotExist:
+  $ref: '#/components/examples/NoSuchNamespaceError'
+406:
+  $ref: '#/components/responses/UnsupportedOperationResponse'
+419:
+  $ref: '#/components/responses/AuthenticationTimeoutResponse'
+503:
+  $ref: '#/components/responses/ServiceUnavailableResponse'
+5XX:
+  $ref: '#/components/responses/ServerErrorResponse'
+
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}:
+parameters:
+  - $ref: '#/components/parameters/prefix'
+  - $ref: '#/components/parameters/namespace'
+  - $ref: '#/components/parameters/table'
+  - $ref: '#/components/parameters/plan-id'
+
+get:
+  tags:
+- Catalog API
+  summary: Fetches the result of scan planning for a plan-id
+  operationId: fetchPlanningResult
+  description: >
+Fetches the result of scan planning for a plan-id.
+
+
+Responses must include a valid status
+
+- When "completed" the planning operation has produced plan-tasks and
+  file-scan-tasks that must be returned in the response
+
+- When "submitted" the planning operation has not completed; the client
+  should wait to call this endpoint again to fetch a completed response
+
+- When "failed" the response must be a valid error response
+
+- When "cancelled" the plan-id is invalid and should be discarded
+
+
+The response for a "

Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]

2024-09-09 Thread via GitHub


sungwy commented on issue #1148:
URL: 
https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338726996

   Hi @isc-patrick - thank you for raising this. These are great points.
   
   I'm wondering if we can keep the current approach of running 
`_ensure_table_exists` within the instantiation of the SqlCatalog, but adopt an 
improved way of running the table creation DDL as you suggested in (2). The 
reason is, because I think as long as the function is idempotent (similar to 
`create_table_if_not_exists` or `CREATE TABLE IF NOT EXISTS`), it's simple 
enough to keep in the initialization of the SqlCatalog, which requires these 
tables to execute any of its member functions.
   
   Looking at the SqlAlchemy dialect further, I'm of the impression that we 
could instead just remove the additional try exception handling and just invoke 
`SqlCatalogBaseTable.metadata.create_all(self.engine)`. According to the 
SqlAlchemy docs, this uses the default flag of `checkfirst=True`, which avoids 
creating the tables if they already exist.
   
   
https://docs.sqlalchemy.org/en/20/core/metadata.html#sqlalchemy.schema.MetaData.create_all
   
   What do you think about this suggestion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg Glue Concurrent Update can result in missing metadata_location [iceberg]

2024-09-09 Thread via GitHub


singhpk234 commented on issue #9411:
URL: https://github.com/apache/iceberg/issues/9411#issuecomment-2338733735

   @shaeqahmed are you seeing this log line too ?
   ```
   LOG.warn(
   "Received unexpected failure when committing to {}, validating 
if commit ended up succeeding.",
   fullTableName,
   persistFailure);
   ```
   I stumbled upon similar error :
   [1] If we can't see this log line, so to me seems like Retry detector didn't 
work as expected and it didn't attempt to reconcile the status
   [2] If we see the log line this means we saw inconsistent state even after 
reconciliation, which will be a bit tricky as i am not sure how glue works in 
this case. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


dramaticlly commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750692281


##
open-api/rest-catalog-open-api.yaml:
##
@@ -541,6 +541,264 @@ paths:
 5XX:
   $ref: '#/components/responses/ServerErrorResponse'
 
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan:
+parameters:
+  - $ref: '#/components/parameters/prefix'
+  - $ref: '#/components/parameters/namespace'
+  - $ref: '#/components/parameters/table'
+post:
+  tags:
+- Catalog API
+  summary: Submit a scan for planning
+  description: >
+Submits a scan for server-side planning.
+
+
+Point-in-time scans are planned by passing snapshot-id to identify the
+table snapshot to scan. Incremental scans are planned by passing both
+start-snapshot-id and end-snapshot-id. Requests that include both point
+in time config properties and incremental config properties are
+invalid. If the request does not include either incremental or
+point-in-time config properties, scan planning should produce a
+point-in-time scan of the latest snapshot in the table's main branch.
+
+
+Responses must include a valid status
+
+- When "completed" the planning operation has produced plan tasks and
+  file scan tasks that must be returned in the response (not fetched
+  later by calling fetchPlanningResult)
+
+- When "submitted" the response must include a plan-id used to poll
+  fetchPlanningResult to fetch the planning result when it is ready
+
+- When "failed" the response must be a valid error response
+
+- Status "cancelled" is not a valid status from this endpoint
+
+
+The response for a "completed" planning operation includes two types of
+tasks (file scan tasks and plan tasks) and both may be included in the
+response. Tasks must not be included for any other response status.
+
+
+Responses that include a plan-id indicate that the service is holding
+state or performing work for the client.
+
+
+- Clients should use the plan-id to fetch results from
+  fetchPlanningResult when the response status is "submitted"
+
+- Clients should inform the service if planning results are no longer
+  needed by calling cancelPlanning. Cancellation is not necessary after
+  fetchScanTasks has been used to fetch scan tasks for each plan task.
+  operationId: planTableScan
+  requestBody:
+content:
+  application/json:
+schema:
+  $ref: '#/components/schemas/PlanTableScanRequest'
+  responses:
+200:
+  $ref: '#/components/responses/PlanTableScanResponse'
+400:
+  $ref: '#/components/responses/BadRequestErrorResponse'
+401:
+  $ref: '#/components/responses/UnauthorizedResponse'
+403:
+  $ref: '#/components/responses/ForbiddenResponse'
+404:
+  description:
+Not Found
+- NoSuchTableException, the table does not exist
+- NoSuchNamespaceException, the namespace does not exist
+  content:
+application/json:
+  schema:
+$ref: '#/components/schemas/IcebergErrorResponse'
+  examples:
+TableDoesNotExist:
+  $ref: '#/components/examples/NoSuchTableError'
+NamespaceDoesNotExist:
+  $ref: '#/components/examples/NoSuchNamespaceError'
+406:
+  $ref: '#/components/responses/UnsupportedOperationResponse'
+419:
+  $ref: '#/components/responses/AuthenticationTimeoutResponse'
+503:
+  $ref: '#/components/responses/ServiceUnavailableResponse'
+5XX:
+  $ref: '#/components/responses/ServerErrorResponse'
+
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}:
+parameters:
+  - $ref: '#/components/parameters/prefix'
+  - $ref: '#/components/parameters/namespace'
+  - $ref: '#/components/parameters/table'
+  - $ref: '#/components/parameters/plan-id'
+
+get:
+  tags:
+- Catalog API
+  summary: Fetches the result of scan planning for a plan-id
+  operationId: fetchPlanningResult
+  description: >
+Fetches the result of scan planning for a plan-id.
+
+
+Responses must include a valid status
+
+- When "completed" the planning operation has produced plan-tasks and
+  file-scan-tasks that must be returned in the response
+
+- When "submitted" the planning operation has not completed; the client
+  should wait to call this endpoint again to fetch a completed response
+
+- When "failed" the response must be a valid error response
+
+- When "cancelled" the plan-id is invalid and should be discarded
+
+
+The response for

Re: [PR] fix: Invert `case_sensitive` logic in StructType [iceberg-python]

2024-09-09 Thread via GitHub


AnthonyLam commented on PR #1147:
URL: https://github.com/apache/iceberg-python/pull/1147#issuecomment-2338752007

   > Hello @AnthonyLam, thank you for your contribution. Could you add a test 
to ensure the expected behavior?
   
   For sure! I've added a test in `test_types.py`. Let me know if it's 
insufficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Bump `duckdb` to version `1.1.0` [iceberg-python]

2024-09-09 Thread via GitHub


kevinjqliu opened a new pull request, #1149:
URL: https://github.com/apache/iceberg-python/pull/1149

   Duckdb version 1.1.0 added support for automatic retries when installing 
extensions (https://github.com/duckdb/duckdb/pull/13122).
   This will help resolve the intermittent CI issue observed in #787
   
   ```
   poetry add duckdb@1.1.0 --optional
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] support python 3.12 [iceberg-python]

2024-09-09 Thread via GitHub


kevinjqliu commented on PR #254:
URL: https://github.com/apache/iceberg-python/pull/254#issuecomment-2338806960

   hey @MehulBatra, can i close this in favor of #1068?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Python: Add support for Python 3.12 [iceberg-python]

2024-09-09 Thread via GitHub


kevinjqliu closed pull request #35: Python: Add support for Python 3.12
URL: https://github.com/apache/iceberg-python/pull/35


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]

2024-09-09 Thread via GitHub


isc-patrick commented on issue #1148:
URL: 
https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338829111

   My thought was to pass an additional argument into init that allowed you to 
turn off creation of tables. My concern is that I don't think you will want 
this auto-create to happen on any system in production, as the catalog should 
already exist and I really want it to fail fast if it does not exist instead of 
creating new tables and failing later, like when the system is live. In 
general, I would never have DDL applied to a DB as part of a dependant 
applications runtime.
   
   So that would mean passing a new boolen into the constructor, say 
create_if_not_exists: bool = True, before the **properties and then replacing 
the _ensure_table_exists method with the create_all conditioned on the new bool.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Add configuration and set better defaults for S3 retry behaviour [iceberg]

2024-09-09 Thread via GitHub


ookumuso commented on PR #11052:
URL: https://github.com/apache/iceberg/pull/11052#issuecomment-2338851624

   Removed all InputStream related changes in favor #10433 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]

2024-09-09 Thread via GitHub


sungwy commented on issue #1148:
URL: 
https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338853251

   Given that the catalog is instantiated through `load_catalog` entrypoint 
function, I think we'd actually need to introduce a catalog property as well, 
if we decide to introduce a new boolean flag to control this behavior. That 
being said, I'm still not convinced that this is necessary...
   
   I think keeping the approach of creating the tables on catalog 
initialization should result in the following:
   1. if the tables exist, the DDL shouldn't be executed
   2. if the tables don't exist, they are created
   3. if the tables don't exist and the DDL fails to create the tables (e.g. 
because the DB doesn't allow creation of tables through standard DDLs), the DB 
should throw an error that should result in the program failing fast.
   
   > I really want it to fail fast if it does not exist instead of creating new 
tables and failing later, like when the system is live
   
   I'm not sure if I follow this point @isc-patrick . If we disable the 
creation of the table through the DDL, we will be failing at the same point, or 
even later. Since instead of failing at the Catalog initialization (for example 
by failing to create the tables), we will now be failing at a later part of the 
code that has assumed that the tables exist, and tries to execute one of the 
SQL queries for scans or commits.
   
   Sorry to probe - could you elaborate on how you expect the proposed approach 
of suppressing table creation would result in the application failing fast?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Impl rest catalog + table updates & requirements [iceberg-go]

2024-09-09 Thread via GitHub


jwtryg opened a new pull request, #146:
URL: https://github.com/apache/iceberg-go/pull/146

   Hi @zeroshade 
   
   I think it's really cool that you are working on a golang-iceberg 
implementation, and I would like to contribute if I can.
   I have tried to finish the rest catalog implementation, and then I have 
added table updates and requirements, using a new MetadataBuilder struct that 
can simplify updating metadata. Like the existing implementation, I have tried 
to stay close to the pyIceberg implementation. 
   
   I thought this could be a good starting point for also implementing 
transactions and table updates. I would love to get your input and 
change/adjust the implementation accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750815966


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).
+When scanning a branch, the table schema should be used (false).
+  type: boolean
+  default: false
+start-snapshot-id:
+  description: Starting snapshot ID for an incremental scan (exclusive)
+  type: integer
+  format: int64
+end-snapshot-id:
+  description:
+Ending snapshot ID for an incremental scan (inclusive).
+
+Required when start-snapshot-id is specified.
+  type: integer
+  format: int64
+stats-fields:
+  description:
+List of fields for which the service should send column stats.
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+
+FieldName:
+  description:
+A full field name (including parent field names), such as those passed
+in APIs like Java `Schema#findField(String name)`.
+
+The nested field name follows these rules
+- Nested struct fields are named by concatenating field names at each
+  struct level using dot (`.`) delimiter, e.g.
+  employer.contact_info.address.zip_code
+- Nested fields in a map key are named using the keyword `key`, e.g.
+  employee_address_map.key.first_name
+- Nested fields in a map value are named using the keyword `value`,
+  e.g. employee_address_map.value.zip_code
+- Nested fields in a list are named using the keyword `element`, e.g.
+  employees.element.first_name
+  type: string
+
+FetchScanTasksRequest:
+  type: object
+  required:
+- plan-task
+  properties:
+plan-task:
+  $ref: '#/components/schemas/PlanTask'
+
+PlanTask:
+  description:
+An opaque string provided by the REST server that represents a
+unit of work to produce file scan tasks for scan planning.
+  type: string

Review Comment:
   I think I largely agree with @flyrain  second sentence, that's good to 
clarify. 
   
   The one part I think I'd remove is the ``PlanTask` is an opaque string 
provided by the REST server that indexes file scan tasks.` I don't think the 
spec should define or even imply that the REST server "indexes" the file scan 
tasks. It should probably be kept open and implementations can maintain that 
relationship how they want.
   
I think the current description "An opaque string provided by the REST 
server that represents a unit of work to produce file scan tasks for scan 
planning." is good in that it doesn't overspecify anything but describes that 
the string represents an abstract unit of work which I think is just the right 
amount of specification.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]

2024-09-09 Thread via GitHub


isc-patrick commented on issue #1148:
URL: 
https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338957758

   I had not yet seen the change introducing load_catalog instead of using the 
SQLCatalog constructor, so just updated to v0.70 - I was away for couple weeks.
   
   My point for the creation of the table comes from work with clients, mostly 
in Financial Services, where there is a formal separation of code that applies 
DDL vs DML. The entire process of Database change control is seperate from any 
application lifecycle. The idea that an application using a DB for reading and 
writing data would be allowed to change the schema would just never be 
considered as a possibility. Let's just forget that and chalk it up to 
different approaches to SDLC. This can easily handled by adding a wrapper to 
load_catalog if necessary.
   
   I can make the change to simplify the _ensure_tables_exist to simply call 
the create_all.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Error: `table_type` missing from table parameters when loading table from Hive metastore [iceberg-python]

2024-09-09 Thread via GitHub


edgarrmondragon commented on issue #1150:
URL: 
https://github.com/apache/iceberg-python/issues/1150#issuecomment-2338970552

   > Who created the table in this case? When PyIceberg creates the table, it 
injects the `table_type` property
   
   I suppose it was created by a third-party and not by 
`HiveCatalog.create_table`. Are only tables created by pyiceberg supported here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]

2024-09-09 Thread via GitHub


sungwy commented on issue #1148:
URL: 
https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338975789

   I work in financial services as well, and actually have the same separation 
between DDL and DML. But I agree with your point in that it would be best to 
not generalize the practices in our industry with other potential development 
practices in different industries, or at different scales of company (start up 
versus big firms).
   
   > I can make the change to simplify the _ensure_tables_exist to simply call 
the create_all.
   
   That would be amazing @isc-patrick . Maybe we can start there, and see if 
there's still a need to make the other proposed change. Thank you again for 
raising this issue and leading this discussion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Error: `table_type` missing from table parameters when loading table from Hive metastore [iceberg-python]

2024-09-09 Thread via GitHub


kevinjqliu commented on issue #1150:
URL: 
https://github.com/apache/iceberg-python/issues/1150#issuecomment-2338993156

   > Are only tables created by pyiceberg supported here?
   
   Anyone can create an iceberg table using HMS, which can be read by 
PyIceberg. In HMS, the assumption is that iceberg tables have a specific 
property set so that engines can distinguish between hive and iceberg tables.
   
   In this case, the table was created as a "hive table" and not an "iceberg 
table". 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] EPIC: Rust Based Compaction [iceberg-rust]

2024-09-09 Thread via GitHub


kevinjqliu commented on issue #624:
URL: https://github.com/apache/iceberg-rust/issues/624#issuecomment-2339045843

   Thanks for starting this! Linking the relevant issue from the pyiceberg 
side, 
[iceberg-python/#1092](https://github.com/apache/iceberg-python/issues/1092)
   Would be great to collaborate on this! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Bump `duckdb` to version `1.1.0` [iceberg-python]

2024-09-09 Thread via GitHub


sungwy commented on code in PR #1149:
URL: https://github.com/apache/iceberg-python/pull/1149#discussion_r1750905022


##
pyproject.toml:
##
@@ -64,7 +64,7 @@ zstandard = ">=0.13.0,<1.0.0"
 tenacity = ">=8.2.3,<10.0.0"
 pyarrow = { version = ">=14.0.0,<18.0.0", optional = true }
 pandas = { version = ">=1.0.0,<3.0.0", optional = true }
-duckdb = { version = ">=0.5.0,<2.0.0", optional = true }
+duckdb = {version = "1.1.0", optional = true}

Review Comment:
   I'm not sure how I feel about pinning this to a single version 🤔 
   
   Is this necessary to resolve the issues with our CI, given that the new 
version is out, and our upper limit is already <2.0.0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Bump `duckdb` to version `1.1.0` [iceberg-python]

2024-09-09 Thread via GitHub


kevinjqliu commented on code in PR #1149:
URL: https://github.com/apache/iceberg-python/pull/1149#discussion_r1750910730


##
pyproject.toml:
##
@@ -64,7 +64,7 @@ zstandard = ">=0.13.0,<1.0.0"
 tenacity = ">=8.2.3,<10.0.0"
 pyarrow = { version = ">=14.0.0,<18.0.0", optional = true }
 pandas = { version = ">=1.0.0,<3.0.0", optional = true }
-duckdb = { version = ">=0.5.0,<2.0.0", optional = true }
+duckdb = {version = "1.1.0", optional = true}

Review Comment:
   Good point, not sure about this one. The CI issue can only be resolved by 
v1.1.0 or higher. I'm not sure if CI will automatically pick the newest version 
of the library.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #11086:
URL: https://github.com/apache/iceberg/pull/11086#discussion_r1750840473


##
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##
@@ -554,6 +562,84 @@ protected boolean cleanupAfterCommit() {
 return true;
   }
 
+  protected List writeDataManifests(List files, 
PartitionSpec spec) {
+return writeDataManifests(files, null /* inherit data seq */, spec);
+  }
+
+  protected List writeDataManifests(
+  List files, Long dataSeq, PartitionSpec spec) {
+return writeManifests(files, group -> writeDataFileGroup(group, dataSeq, 
spec));
+  }
+
+  private List writeDataFileGroup(
+  List files, Long dataSeq, PartitionSpec spec) {
+RollingManifestWriter writer = newRollingManifestWriter(spec);
+
+try (RollingManifestWriter closableWriter = writer) {
+  if (dataSeq != null) {
+files.forEach(file -> closableWriter.add(file, dataSeq));
+  } else {
+files.forEach(closableWriter::add);
+  }
+} catch (IOException e) {
+  throw new RuntimeIOException(e, "Failed to write data manifests");
+}
+
+return writer.toManifestFiles();
+  }
+
+  protected List writeDeleteManifests(
+  List files, PartitionSpec spec) {
+return writeManifests(files, group -> writeDeleteFileGroup(group, spec));
+  }
+
+  private List writeDeleteFileGroup(
+  List files, PartitionSpec spec) {
+RollingManifestWriter writer = 
newRollingDeleteManifestWriter(spec);
+
+try (RollingManifestWriter closableWriter = writer) {
+  for (DeleteFileHolder file : files) {
+if (file.dataSequenceNumber() != null) {
+  closableWriter.add(file.deleteFile(), file.dataSequenceNumber());
+} else {
+  closableWriter.add(file.deleteFile());
+}
+  }
+} catch (IOException e) {
+  throw new RuntimeIOException(e, "Failed to write delete manifests");
+}
+
+return writer.toManifestFiles();
+  }
+
+  private static  List writeManifests(
+  List files, Function, List> writeFunc) {
+int groupSize = manifestFileGroupSize(ThreadPools.WORKER_THREAD_POOL_SIZE, 
files.size());
+List> groups = Lists.partition(files, groupSize);
+Queue manifests = Queues.newConcurrentLinkedQueue();
+Tasks.foreach(groups)
+.stopOnFailure()
+.throwFailureWhenFinished()
+.executeWith(ThreadPools.getWorkerPool())
+.run(group -> manifests.addAll(writeFunc.apply(group)));
+return ImmutableList.copyOf(manifests);
+  }
+
+  /**
+   * Calculates how many files can be processed concurrently depending on the 
provided parallelism
+   * without creating too small manifests.
+   *
+   * @param parallelism the max number of concurrent writers
+   * @param fileCount the total number of files to be written
+   * @return the size of each file group, adjusted to balance workload across 
available writers
+   */
+  @VisibleForTesting
+  static int manifestFileGroupSize(int parallelism, int fileCount) {
+int maxParallelism = IntMath.divide(fileCount, MIN_FILE_GROUP_SIZE, 
RoundingMode.HALF_UP);
+int actualParallelism = Math.max(1, Math.min(parallelism, maxParallelism));

Review Comment:
   I'm actually OK with the current naming of maxParallelism and 
actualParallelism. Do we want to just inline comment the handling of the cases 
where there's not a full manifest file group? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


rahil-c commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750944397


##
open-api/rest-catalog-open-api.yaml:
##
@@ -541,6 +541,264 @@ paths:
 5XX:
   $ref: '#/components/responses/ServerErrorResponse'
 
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan:
+parameters:
+  - $ref: '#/components/parameters/prefix'
+  - $ref: '#/components/parameters/namespace'
+  - $ref: '#/components/parameters/table'
+post:
+  tags:
+- Catalog API
+  summary: Submit a scan for planning
+  description: >
+Submits a scan for server-side planning.
+
+
+Point-in-time scans are planned by passing snapshot-id to identify the
+table snapshot to scan. Incremental scans are planned by passing both
+start-snapshot-id and end-snapshot-id. Requests that include both point
+in time config properties and incremental config properties are
+invalid. If the request does not include either incremental or
+point-in-time config properties, scan planning should produce a
+point-in-time scan of the latest snapshot in the table's main branch.
+
+
+Responses must include a valid status
+
+- When "completed" the planning operation has produced plan tasks and
+  file scan tasks that must be returned in the response (not fetched
+  later by calling fetchPlanningResult)
+
+- When "submitted" the response must include a plan-id used to poll
+  fetchPlanningResult to fetch the planning result when it is ready
+
+- When "failed" the response must be a valid error response
+
+- Status "cancelled" is not a valid status from this endpoint
+
+
+The response for a "completed" planning operation includes two types of
+tasks (file scan tasks and plan tasks) and both may be included in the
+response. Tasks must not be included for any other response status.
+
+
+Responses that include a plan-id indicate that the service is holding
+state or performing work for the client.
+
+
+- Clients should use the plan-id to fetch results from
+  fetchPlanningResult when the response status is "submitted"
+
+- Clients should inform the service if planning results are no longer
+  needed by calling cancelPlanning. Cancellation is not necessary after
+  fetchScanTasks has been used to fetch scan tasks for each plan task.
+  operationId: planTableScan
+  requestBody:
+content:
+  application/json:
+schema:
+  $ref: '#/components/schemas/PlanTableScanRequest'
+  responses:
+200:
+  $ref: '#/components/responses/PlanTableScanResponse'
+400:
+  $ref: '#/components/responses/BadRequestErrorResponse'
+401:
+  $ref: '#/components/responses/UnauthorizedResponse'
+403:
+  $ref: '#/components/responses/ForbiddenResponse'
+404:
+  description:
+Not Found
+- NoSuchTableException, the table does not exist
+- NoSuchNamespaceException, the namespace does not exist
+  content:
+application/json:
+  schema:
+$ref: '#/components/schemas/IcebergErrorResponse'
+  examples:
+TableDoesNotExist:
+  $ref: '#/components/examples/NoSuchTableError'
+NamespaceDoesNotExist:
+  $ref: '#/components/examples/NoSuchNamespaceError'
+406:
+  $ref: '#/components/responses/UnsupportedOperationResponse'
+419:
+  $ref: '#/components/responses/AuthenticationTimeoutResponse'
+503:
+  $ref: '#/components/responses/ServiceUnavailableResponse'
+5XX:
+  $ref: '#/components/responses/ServerErrorResponse'
+
+  /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}:
+parameters:
+  - $ref: '#/components/parameters/prefix'
+  - $ref: '#/components/parameters/namespace'
+  - $ref: '#/components/parameters/table'
+  - $ref: '#/components/parameters/plan-id'
+
+get:
+  tags:
+- Catalog API
+  summary: Fetches the result of scan planning for a plan-id
+  operationId: fetchPlanningResult
+  description: >
+Fetches the result of scan planning for a plan-id.
+
+
+Responses must include a valid status
+
+- When "completed" the planning operation has produced plan-tasks and
+  file-scan-tasks that must be returned in the response
+
+- When "submitted" the planning operation has not completed; the client
+  should wait to call this endpoint again to fetch a completed response
+
+- When "failed" the response must be a valid error response
+
+- When "cancelled" the plan-id is invalid and should be discarded
+
+
+The response for a "

Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


rahil-c commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750946862


##
open-api/rest-catalog-open-api.yaml:
##
@@ -2774,6 +3062,140 @@ components:
   additionalProperties:
 type: string
 
+ScanTasks:
+  type: object
+  description:
+Scan and planning tasks for server-side scan planning
+
+
+- `plan-tasks` contains opaque units of planning work

Review Comment:
   Will add `>` to the description.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


flyrain commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750948304


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).
+When scanning a branch, the table schema should be used (false).
+  type: boolean
+  default: false
+start-snapshot-id:
+  description: Starting snapshot ID for an incremental scan (exclusive)
+  type: integer
+  format: int64
+end-snapshot-id:
+  description:
+Ending snapshot ID for an incremental scan (inclusive).
+
+Required when start-snapshot-id is specified.
+  type: integer
+  format: int64
+stats-fields:
+  description:
+List of fields for which the service should send column stats.
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+
+FieldName:
+  description:
+A full field name (including parent field names), such as those passed
+in APIs like Java `Schema#findField(String name)`.
+
+The nested field name follows these rules
+- Nested struct fields are named by concatenating field names at each
+  struct level using dot (`.`) delimiter, e.g.
+  employer.contact_info.address.zip_code
+- Nested fields in a map key are named using the keyword `key`, e.g.
+  employee_address_map.key.first_name
+- Nested fields in a map value are named using the keyword `value`,
+  e.g. employee_address_map.value.zip_code
+- Nested fields in a list are named using the keyword `element`, e.g.
+  employees.element.first_name
+  type: string
+
+FetchScanTasksRequest:
+  type: object
+  required:
+- plan-task
+  properties:
+plan-task:
+  $ref: '#/components/schemas/PlanTask'
+
+PlanTask:
+  description:
+An opaque string provided by the REST server that represents a
+unit of work to produce file scan tasks for scan planning.
+  type: string

Review Comment:
   Makes sense to avoid detailed approach in the spec, but I'd suggest to add 
more details on why we need a plan task string.
   
   ```
   This allows clients to fetch tasks in batches, which is useful for handling 
large result sets efficiently.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Bump arrow to 53 [iceberg-rust]

2024-09-09 Thread via GitHub


sdd commented on issue #622:
URL: https://github.com/apache/iceberg-rust/issues/622#issuecomment-2339158328

   This is blocked on DataFusion. The most recently released version of 
DataFusion, v41.0.0, depends on arrow-* 52.
   
   Once https://github.com/apache/datafusion/pull/12032 is in a released 
version, we can proceed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   @flyrain All good! To clarify, the distinction in the schema that gets used 
during time travel is something that got established in the reference 
implementation but is not actually defined in the spec itself. 
   
   
   As for the rationale behind this behavior in the reference implementation 
please refer to
   https://github.com/apache/iceberg/pull/9131/files.
   
   I don't think I'd spec'ing out which ones clients should use in which 
situation because the schema resolution behavior is not in the table spec 
itself. 
   
   I do think it's probably beneficial to provide some more context as to why 
this field exists, which is to enable clients to align with the reference 
implementation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   @flyrain All good! To clarify, the distinction in the schema that gets used 
during time travel is something that got established in the reference 
implementation but is not actually defined in the spec itself. 
   
   
   As for the rationale behind this behavior in the reference implementation 
please refer to
   https://github.com/apache/iceberg/pull/9131/files.
   
   I don't think I'd spec'ing out which ones clients should use in which 
situation because the schema resolution behavior is not in the table spec 
itself. 
   
   I do think it's probably beneficial to provide some more context as to why 
this field exists, which is to enable clients to align with the reference 
implementation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   @flyrain All good! To clarify, the distinction in the schema that gets used 
during time travel is something that got established in the reference 
implementation but is not actually defined in the spec itself. 
   
   
   As for the rationale behind this behavior in the reference implementation 
please refer to
   https://github.com/apache/iceberg/pull/9131/files.
   
   I don't think I'd spec'ing out which ones clients should use in which 
situation because the schema resolution behavior is not defined in the table 
spec itself. 
   
   I do think it's probably beneficial to provide some more context as to why 
this field exists, which is to enable clients to align with the reference 
implementation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   @flyrain All good! To clarify, the distinction in the schema that gets used 
during time travel is a convention that got established in the reference 
implementation but is not actually defined in the spec itself. 
   
   As for the rationale behind this behavior in the reference implementation 
please refer to
   https://github.com/apache/iceberg/pull/9131/files.
   
   I don't think I'd spec out which ones clients should use in which situation 
because the schema resolution behavior is not defined in the table spec itself.
   
   I do think it's probably beneficial to provide some more context as to why 
this field exists, which is to enable clients to align with the reference 
implementation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Bump duckdb from 1.0.0 to 1.1.0 [iceberg-python]

2024-09-09 Thread via GitHub


dependabot[bot] opened a new pull request, #1152:
URL: https://github.com/apache/iceberg-python/pull/1152

   Bumps [duckdb](https://github.com/duckdb/duckdb) from 1.0.0 to 1.1.0.
   
   Release notes
   Sourced from https://github.com/duckdb/duckdb/releases";>duckdb's releases.
   
   DuckDB 1.1.0 "Eatoni"
   This release of DuckDB is named "Eatoni" after Eaton's pintail 
(Anas Eatoni) from the southern Indian Ocean.
   Please also refer to the announcement blog post: https://duckdb.org/2024/09/09/announcing-duckdb-110";>https://duckdb.org/2024/09/09/announcing-duckdb-110
   What's Changed
   
   Add feature changes back in by https://github.com/Mytherin";>@​Mytherin in https://redirect.github.com/duckdb/duckdb/pull/11146";>duckdb/duckdb#11146
   Make MultiFileReader filename configurable by https://github.com/lnkuiper";>@​lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11178";>duckdb/duckdb#11178
   [Dev] Fix compilation issues on feature by https://github.com/Tishj";>@​Tishj in https://redirect.github.com/duckdb/duckdb/pull/11082";>duckdb/duckdb#11082
   add query() and query_table() functions by https://github.com/chrisiou";>@​chrisiou in https://redirect.github.com/duckdb/duckdb/pull/10586";>duckdb/duckdb#10586
   [Block Size] Move the block allocation size into the block manager by https://github.com/taniabogatsch";>@​taniabogatsch in https://redirect.github.com/duckdb/duckdb/pull/11176";>duckdb/duckdb#11176
   LIMIT pushdown below PROJECT by https://github.com/jeewonhh";>@​jeewonhh in https://redirect.github.com/duckdb/duckdb/pull/2";>duckdb/duckdb#2
   BUGFIX: IN () filter with one argument should translate to = filter.  by 
https://github.com/Tmonster";>@​Tmonster in https://redirect.github.com/duckdb/duckdb/pull/11473";>duckdb/duckdb#11473
   Regression Script should calculate micro benchmark differences with the 
correct base branch by https://github.com/Tmonster";>@​Tmonster in https://redirect.github.com/duckdb/duckdb/pull/11762";>duckdb/duckdb#11762
   Pushdown filters on window partitions by https://github.com/Tmonster";>@​Tmonster in https://redirect.github.com/duckdb/duckdb/pull/10932";>duckdb/duckdb#10932
   Arrow ListView Type by https://github.com/Tishj";>@​Tishj in https://redirect.github.com/duckdb/duckdb/pull/10766";>duckdb/duckdb#10766
   Add scalar function support to the C API by https://github.com/Mytherin";>@​Mytherin in https://redirect.github.com/duckdb/duckdb/pull/11786";>duckdb/duckdb#11786
   Add TopN optimization in physical plan mapping by https://github.com/kryonix";>@​kryonix in https://redirect.github.com/duckdb/duckdb/pull/11290";>duckdb/duckdb#11290
   Join-dependent filter derivation by https://github.com/lnkuiper";>@​lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11272";>duckdb/duckdb#11272
   Implement ROW_GROUPS_PER_FILE for Parquet by https://github.com/lnkuiper";>@​lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11249";>duckdb/duckdb#11249
   Prefer Final projected columns on probe side if cardinalities are 
similar by https://github.com/Tmonster";>@​Tmonster in 
https://redirect.github.com/duckdb/duckdb/pull/11109";>duckdb/duckdb#11109
   Propagate unused columns to distinct on by https://github.com/Tmonster";>@​Tmonster in https://redirect.github.com/duckdb/duckdb/pull/11006";>duckdb/duckdb#11006
   Separate eviction queues by FileBufferType by https://github.com/lnkuiper";>@​lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11417";>duckdb/duckdb#11417
   Disable false positive for vector size nightly in test by https://github.com/taniabogatsch";>@​taniabogatsch in https://redirect.github.com/duckdb/duckdb/pull/11953";>duckdb/duckdb#11953
   Rework jemalloc extension by https://github.com/lnkuiper";>@​lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11891";>duckdb/duckdb#11891
   Tweak jemalloc config by https://github.com/lnkuiper";>@​lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/12034";>duckdb/duckdb#12034
   Httpfs test to nightly by https://github.com/carlopi";>@​carlopi in https://redirect.github.com/duckdb/duckdb/pull/12196";>duckdb/duckdb#12196
   Removed three reinterpret casts and some rewriting by https://github.com/taniabogatsch";>@​taniabogatsch in https://redirect.github.com/duckdb/duckdb/pull/12200";>duckdb/duckdb#12200
   Begin Profiling Rework to move towards Modularity by https://github.com/maiadegraaf";>@​maiadegraaf in https://redirect.github.com/duckdb/duckdb/pull/11101";>duckdb/duckdb#11101
   [CLI] Add highlighting + limited auto-complete for shell dot commands by 
https://github.com/Mytherin";>@​Mytherin in https://redirect.github.com/duckdb/duckdb/pull/12201";>duckdb/duckdb#12201
   Skip test to fix block size nightly and add more explicit error checking 
by https://github.com/taniabogatsch";>@​taniabogatsch 
in https://redirect.github.com/duckdb/duckdb/pull/12211";>duckdb/duckdb#12211
   Remove BLOCK_ALLOC_SIZE from the column segment files by https://github.c

[PR] Bump pydantic from 2.9.0 to 2.9.1 [iceberg-python]

2024-09-09 Thread via GitHub


dependabot[bot] opened a new pull request, #1154:
URL: https://github.com/apache/iceberg-python/pull/1154

   Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.9.0 to 2.9.1.
   
   Release notes
   Sourced from https://github.com/pydantic/pydantic/releases";>pydantic's 
releases.
   
   v2.9.1 (2024-09-09)
   What's Changed
   Fixes
   
   Fix Predicate issue in v2.9.0 by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10321";>#10321
   Fixing annotated-types bound to >=0.6.0 by 
https://github.com/sydney-runkle";>@​sydney-runkle in 
https://redirect.github.com/pydantic/pydantic/pull/10327";>#10327
   Turn tzdata install requirement into optional 
timezone dependency by https://github.com/jakob-keller";>@​jakob-keller in https://redirect.github.com/pydantic/pydantic/pull/10331";>#10331
   Fix IncExc type alias definition by https://github.com/Viicos";>@​Viicos in https://redirect.github.com/pydantic/pydantic/pull/10339";>#10339
   Use correct types namespace when building namedtuple core schemas by https://github.com/Viicos";>@​Viicos in https://redirect.github.com/pydantic/pydantic/pull/10337";>#10337
   Fix evaluation of stringified annotations during namespace inspection by 
https://github.com/Viicos";>@​Viicos in https://redirect.github.com/pydantic/pydantic/pull/10347";>#10347
   Fix tagged union serialization with alias generators by https://github.com/sydney-runkle";>@​sydney-runkle in https://redirect.github.com/pydantic/pydantic-core/pull/1442";>pydantic/pydantic-core#1442
   
   Full Changelog: https://github.com/pydantic/pydantic/compare/v2.9.0...v2.9.1";>https://github.com/pydantic/pydantic/compare/v2.9.0...v2.9.1
   
   
   
   Commits
   
   https://github.com/pydantic/pydantic/commit/ecc5275d01e3d8de15c3641d35eb5151f5778833";>ecc5275
 bump
   https://github.com/pydantic/pydantic/commit/2c61bfda43e67b8308f86c77ae4121f447f134dd";>2c61bfd
 Fix evaluation of stringified annotations during namespace inspection (https://redirect.github.com/pydantic/pydantic/issues/10347";>#10347)
   https://github.com/pydantic/pydantic/commit/3d364cbf994bc6676b8419b8ad588d4d49ab2f29";>3d364cb
 Use correct types namespace when building namedtuple core schemas (https://redirect.github.com/pydantic/pydantic/issues/10337";>#10337)
   https://github.com/pydantic/pydantic/commit/2746ccba230b47d279ed5aa4e4831bbdba60ad70";>2746ccb
 Fix IncEx type alias definition (https://redirect.github.com/pydantic/pydantic/issues/10339";>#10339)
   https://github.com/pydantic/pydantic/commit/b32d4109675316912b99a7f4fc56dcbf2c73840c";>b32d410
 Turn tzdata install requirement into optional 
timezone dependency (https://redirect.github.com/pydantic/pydantic/issues/10331";>#10331)
   https://github.com/pydantic/pydantic/commit/7d857eb89c4f3c0389f8e12d83f14c89fab75f37";>7d857eb
 Fixing annotated-types bound (https://redirect.github.com/pydantic/pydantic/issues/10327";>#10327)
   https://github.com/pydantic/pydantic/commit/07cbe50fa0a7d217d8382f79c43d02201d25a4fe";>07cbe50
 Fix Predicate issue in v2.9.0 (https://redirect.github.com/pydantic/pydantic/issues/10321";>#10321)
   See full diff in https://github.com/pydantic/pydantic/compare/v2.9.0...v2.9.1";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pydantic&package-manager=pip&previous-version=2.9.0&new-version=2.9.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this mino

Re: [PR] Bump `duckdb` to version `1.1.0` [iceberg-python]

2024-09-09 Thread via GitHub


kevinjqliu commented on PR #1149:
URL: https://github.com/apache/iceberg-python/pull/1149#issuecomment-2339300626

   Updated the PR to only change duckdb from v1.0.0 to v1.1.0 in the 
poetry.lock file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   @flyrain All good! To clarify, the distinction in the schema that gets used 
during time travel is a convention that got established in the reference 
implementation but is not actually defined in the spec itself. 
   
   As for the rationale behind this behavior in the reference implementation 
please refer to
   https://github.com/apache/iceberg/pull/9131/files.
   
   I don't think I'd spec out which ones clients should use in which situation 
because the schema resolution behavior is not defined in the table spec itself.
   
   I do think it's probably beneficial to provide some more context as to why 
this field exists, which is to enable clients to align with the reference 
implementation. 
   
   Edit:
   A concrete suggestion on a description that provides some context:
   ```
   This flag is useful when a client wants to read from a branch and use the 
table schema or time travel to a specific a snapshot and use the snapshot 
schema (aligning with the reference implementation)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Cache Manifest files [iceberg-python]

2024-09-09 Thread via GitHub


corleyma commented on code in PR #787:
URL: https://github.com/apache/iceberg-python/pull/787#discussion_r1751060909


##
pyiceberg/table/snapshots.py:
##
@@ -228,6 +229,13 @@ def __eq__(self, other: Any) -> bool:
 )
 
 
+@lru_cache

Review Comment:
   is 128 big enough?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Aliyun: Migrate tests to junit5 for aliyun client factory [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] commented on PR #7853:
URL: https://github.com/apache/iceberg/pull/7853#issuecomment-2339368891

   This pull request has been closed due to lack of activity. This is not a 
judgement on the merit of the PR in any way. It is just a way of keeping the PR 
queue manageable. If you think that is incorrect, or the pull request requires 
review, you can revive the PR at any time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Aliyun: Migrate tests to junit5 for aliyun client factory [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] closed pull request #7853: Aliyun: Migrate tests to junit5 
for aliyun client factory
URL: https://github.com/apache/iceberg/pull/7853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Convert Evaluator in expression to use a Comparator [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] closed pull request #7883: API: Convert Evaluator in 
expression to use a Comparator
URL: https://github.com/apache/iceberg/pull/7883


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: Convert Evaluator in expression to use a Comparator [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] commented on PR #7883:
URL: https://github.com/apache/iceberg/pull/7883#issuecomment-2339368920

   This pull request has been closed due to lack of activity. This is not a 
judgement on the merit of the PR in any way. It is just a way of keeping the PR 
queue manageable. If you think that is incorrect, or the pull request requires 
review, you can revive the PR at any time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg Java Api - S3 Session Token - 403 Forbidden exception [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] commented on issue #8190:
URL: https://github.com/apache/iceberg/issues/8190#issuecomment-2339369252

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: check location for conflict before creating table [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] commented on PR #8194:
URL: https://github.com/apache/iceberg/pull/8194#issuecomment-2339369273

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Request to add KLL Datasketch and hive ColumnStatisticsObj and as standard blob types to puffin file. [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] commented on issue #8198:
URL: https://github.com/apache/iceberg/issues/8198#issuecomment-2339369316

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support partial insert in merge into command [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] commented on issue #8199:
URL: https://github.com/apache/iceberg/issues/8199#issuecomment-2339369339

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… [iceberg]

2024-09-09 Thread via GitHub


github-actions[bot] commented on PR #8202:
URL: https://github.com/apache/iceberg/pull/8202#issuecomment-2339369354

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]

2024-09-09 Thread via GitHub


dramaticlly commented on code in PR #11086:
URL: https://github.com/apache/iceberg/pull/11086#discussion_r1751080154


##
core/src/test/java/org/apache/iceberg/TestSnapshotProducer.java:
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+import org.junit.jupiter.api.Test;
+
+public class TestSnapshotProducer {
+  @Test
+  public void testManifestFileGroupSize() {
+assertEquals(
+1_000,
+SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 1_000 /* 
file count */),
+"Must return file count when it is small");
+
+assertEquals(
+10_000,
+SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 10_000 /* 
file count */),
+"Must return file count when it matches the min group size");
+
+assertEquals(
+10_001,
+SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 10_001 /* 
file count */),
+"Must return file count when it is slightly above min group size");
+
+assertEquals(
+12_500,
+SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 12_500 /* 
file count */),
+"Must return file count when remainder is < 50% of min group size");
+
+assertEquals(
+7_500,
+SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 15_000 /* 
file count */),
+"Must split into 2 groups when remainder is >= 50% of min group size");
+
+assertEquals(
+16_667,
+SnapshotProducer.manifestFileGroupSize(3 /* parallelism */, 50_000 /* 
file count */),
+"Must split into 3 groups when file count is large");
+
+assertEquals(
+166_667,
+SnapshotProducer.manifestFileGroupSize(3 /* parallelism */, 500_000 /* 
file count */),
+"Provided parallelism must be respected when file count is large");
+
+assertEquals(
+10_000,
+SnapshotProducer.manifestFileGroupSize(32 /* parallelism */, 50_000 /* 
file count */),
+"Max parallelism based on file count must be respected");

Review Comment:
   this read a bit weird, I am thinking maybe `Provided parallelism will be 
adjusted if not enough files in each group`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] RESTSessionCatalog loadTable incorrect set last access time as metadata last update time [iceberg]

2024-09-09 Thread via GitHub


haizhou-zhao opened a new issue, #11103:
URL: https://github.com/apache/iceberg/issues/11103

   ### Apache Iceberg version
   
   1.6.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   ## Background
   When using Spark (or likewise execution engines) on top of REST Catalog to 
commit a new snapshot, there are 3 critical timestamps:
   
   1. When Spark invokes `SnapshotProducer` to produce a new snapshot - 
referred to as `snapshotCreationTs`: 
[ref](https://github.com/apache/iceberg/blob/d17a7f1/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L386)
   2. When REST catalog receives the update request from Spark and commit new 
metadata - referred to as `metadataCommitTs`
   3. At the time when commit is done finished, and Spark refreshes table state 
- referred to as `tableAccessTs`: 
[ref](https://github.com/apache/iceberg/blob/afda8be/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L64)
   
   ## Desired behavior
   `spark.sql("SELECT * from ${db}.${table}.metadata_log_entries")` means the 
user is looking for `metadataCommitTs`; 
   
   while `spark.sql("SELECT * from ${db}.${table}. snapshots")` means the user 
is looking for `snapshotCreationTs`.
   
   
   ## Issue 1
   Traditionally, with Hadoop and Hive Catalog, Spark (using Iceberg client) is 
responsible for both generating new snapshots and committing the new metadata 
files. And those catalogs are capable of enforcing `snapshotCreationTs` and 
`metadataCommitTs` to take on the exact same value for every new snapshot 
committed. However, with REST Catalog, because Spark only controls new snapshot 
generation while REST Catalog Server controls metadata commit, based on REST 
Catalog implementation (which is not controllable by this repo), 
`snapshotCreationTs` and `metadataCommitTs` may or may not be the same.
   
   The current implementation in some cases assumes that the two timestamp 
takes on exact same value. For example, this integration test: 
[ref](https://github.com/apache/iceberg/blob/2240154/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java#L446).
   
   It would be best to clarify whether a valid REST Catalog implementation 
should always enforce `snapshotCreationTs` and `metadataCommitTs` to take on 
the same value. For the current reference implementation (RESTCatalogAdapter on 
top of JdbcCatalog), the answer is positive (they take on the same value). 
However, should it be (or feasible to be) enforced for all REST Catalog 
implementations.
   
   ## Issue 2
   Currently, when loading Table from REST Catalog using LoadTableResponse, the 
`lastUpdatedMillis` attribute of the metadata (which may be taking the value of 
`metadataCommitTs` or `snapshotCreationTs` on REST Catalog side based on impl 
detail) will be incorrectly replaced by `tableAccessTs` 
([ref1](https://github.com/apache/iceberg/blob/afda8be/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L64),
 
[ref2](https://github.com/apache/iceberg/blob/113c6e7/core/src/main/java/org/apache/iceberg/TableMetadata.java#L938)).
 Because Spark depends on `lastUpdatedMillis` to generate the latest 
`metadataCommitTs` on `metadata_log_entries` 
([ref](https://github.com/apache/iceberg/blob/8a70fe0/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L66)),
 there will always be a wrong time stamp on `metadata_log_entries` if REST 
Catalog is used.
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Adding RESTCatalog based Spark Smoke Test [iceberg]

2024-09-09 Thread via GitHub


haizhou-zhao commented on issue #11079:
URL: https://github.com/apache/iceberg/issues/11079#issuecomment-2339395130

   Found several issues while attempting to enable integration test based on 
REST Catalog for Spark. Listing them below as I troubleshoot them one by one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


rahil-c commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1751106398


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   Thanks for your insights@amogh-jahagirdar @flyrain 
   
   However I actually think the current description for this is pretty 
straightforward and do not think we need to explain in the spec the reference 
implementation context around "snapshot schema" .
   
   cc @rdblue @danielcweeks if you think we should clarify this further or keep 
as is. 



##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   Thanks for your insights @amogh-jahagirdar @flyrain 
   
   However I actually think the current description for this is pretty 
straightforward and do not think we need to explain in the spec the reference 
implementation context around "snapshot schema" .
   
   cc @rdblue @danielcweeks if you think we should clarify this further or keep 
as is. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]

2024-09-09 Thread via GitHub


karuppayya commented on code in PR #11086:
URL: https://github.com/apache/iceberg/pull/11086#discussion_r1751070837


##
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##
@@ -554,6 +562,84 @@ protected boolean cleanupAfterCommit() {
 return true;
   }
 
+  protected List writeDataManifests(List files, 
PartitionSpec spec) {
+return writeDataManifests(files, null /* inherit data seq */, spec);
+  }
+
+  protected List writeDataManifests(
+  List files, Long dataSeq, PartitionSpec spec) {
+return writeManifests(files, group -> writeDataFileGroup(group, dataSeq, 
spec));
+  }
+
+  private List writeDataFileGroup(
+  List files, Long dataSeq, PartitionSpec spec) {
+RollingManifestWriter writer = newRollingManifestWriter(spec);
+
+try (RollingManifestWriter closableWriter = writer) {
+  if (dataSeq != null) {
+files.forEach(file -> closableWriter.add(file, dataSeq));
+  } else {
+files.forEach(closableWriter::add);
+  }
+} catch (IOException e) {
+  throw new RuntimeIOException(e, "Failed to write data manifests");
+}
+
+return writer.toManifestFiles();
+  }
+
+  protected List writeDeleteManifests(
+  List files, PartitionSpec spec) {
+return writeManifests(files, group -> writeDeleteFileGroup(group, spec));
+  }
+
+  private List writeDeleteFileGroup(
+  List files, PartitionSpec spec) {
+RollingManifestWriter writer = 
newRollingDeleteManifestWriter(spec);
+
+try (RollingManifestWriter closableWriter = writer) {
+  for (DeleteFileHolder file : files) {
+if (file.dataSequenceNumber() != null) {
+  closableWriter.add(file.deleteFile(), file.dataSequenceNumber());
+} else {
+  closableWriter.add(file.deleteFile());
+}
+  }
+} catch (IOException e) {
+  throw new RuntimeIOException(e, "Failed to write delete manifests");
+}
+
+return writer.toManifestFiles();
+  }
+
+  private static  List writeManifests(
+  List files, Function, List> writeFunc) {
+int groupSize = manifestFileGroupSize(ThreadPools.WORKER_THREAD_POOL_SIZE, 
files.size());
+List> groups = Lists.partition(files, groupSize);
+Queue manifests = Queues.newConcurrentLinkedQueue();
+Tasks.foreach(groups)
+.stopOnFailure()
+.throwFailureWhenFinished()
+.executeWith(ThreadPools.getWorkerPool())
+.run(group -> manifests.addAll(writeFunc.apply(group)));
+return ImmutableList.copyOf(manifests);
+  }
+
+  /**
+   * Calculates how many files can be processed concurrently depending on the 
provided parallelism

Review Comment:
   Slight change in behavior from old code
   1. I think it is the prior code we would have only one(the last) that is not 
sized rightly.
   In this case, we would have more(equal to the parallelism) of not rightly 
sized manifest. 
   2. The max number of datafile in any manifest is 10_000
   ```private static final int MIN_FILE_GROUP_SIZE = 10_000;```
   
   I guess 1 is ok
   Should 2 be configurable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]

2024-09-09 Thread via GitHub


karuppayya commented on code in PR #11086:
URL: https://github.com/apache/iceberg/pull/11086#discussion_r1751128701


##
core/src/main/java/org/apache/iceberg/SnapshotProducer.java:
##
@@ -554,6 +562,84 @@ protected boolean cleanupAfterCommit() {
 return true;
   }
 
+  protected List writeDataManifests(List files, 
PartitionSpec spec) {
+return writeDataManifests(files, null /* inherit data seq */, spec);
+  }
+
+  protected List writeDataManifests(
+  List files, Long dataSeq, PartitionSpec spec) {
+return writeManifests(files, group -> writeDataFileGroup(group, dataSeq, 
spec));
+  }
+
+  private List writeDataFileGroup(
+  List files, Long dataSeq, PartitionSpec spec) {
+RollingManifestWriter writer = newRollingManifestWriter(spec);
+
+try (RollingManifestWriter closableWriter = writer) {
+  if (dataSeq != null) {
+files.forEach(file -> closableWriter.add(file, dataSeq));
+  } else {
+files.forEach(closableWriter::add);
+  }
+} catch (IOException e) {
+  throw new RuntimeIOException(e, "Failed to write data manifests");
+}
+
+return writer.toManifestFiles();
+  }
+
+  protected List writeDeleteManifests(
+  List files, PartitionSpec spec) {
+return writeManifests(files, group -> writeDeleteFileGroup(group, spec));
+  }
+
+  private List writeDeleteFileGroup(
+  List files, PartitionSpec spec) {
+RollingManifestWriter writer = 
newRollingDeleteManifestWriter(spec);
+
+try (RollingManifestWriter closableWriter = writer) {
+  for (DeleteFileHolder file : files) {
+if (file.dataSequenceNumber() != null) {
+  closableWriter.add(file.deleteFile(), file.dataSequenceNumber());
+} else {
+  closableWriter.add(file.deleteFile());
+}
+  }
+} catch (IOException e) {
+  throw new RuntimeIOException(e, "Failed to write delete manifests");
+}
+
+return writer.toManifestFiles();
+  }
+
+  private static  List writeManifests(
+  List files, Function, List> writeFunc) {
+int groupSize = manifestFileGroupSize(ThreadPools.WORKER_THREAD_POOL_SIZE, 
files.size());
+List> groups = Lists.partition(files, groupSize);
+Queue manifests = Queues.newConcurrentLinkedQueue();
+Tasks.foreach(groups)
+.stopOnFailure()
+.throwFailureWhenFinished()

Review Comment:
   Do we want to do any clean up on failures?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1751149711


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   It's OK as is, I do think adding that historical context from the reference 
implementation is valuable for readers because without it the utility of the 
client side flag is unclear without digging through PRs or discussion threads.
   
   Def not a blocker imo, if others think this context is useful I think we can 
address this in a follow on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1751149711


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   It's OK as is, I do think adding that historical context from the reference 
implementation is valuable for readers because without it the utility of the 
client side flag is unclear without digging through PRs or discussion threads.
   
   Def not a blocker imo, if others think this context is useful I think we can 
address this in a follow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]

2024-09-09 Thread via GitHub


amogh-jahagirdar commented on code in PR #9695:
URL: https://github.com/apache/iceberg/pull/9695#discussion_r1751149711


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3647,6 +4080,105 @@ components:
 type: integer
   description: "List of equality field IDs"
 
+PlanTableScanRequest:
+  type: object
+  properties:
+snapshot-id:
+  description:
+Identifier for the snapshot to scan in a point-in-time scan
+  type: integer
+  format: int64
+select:
+  description: List of selected schema fields
+  type: array
+  items:
+$ref: '#/components/schemas/FieldName'
+filter:
+  description:
+Expression used to filter the table data
+  $ref: '#/components/schemas/Expression'
+case-sensitive:
+  description: Enables case sensitive field matching for filter and 
select
+  type: boolean
+  default: true
+use-snapshot-schema:
+  description:
+Whether to use the schema at the time the snapshot was written.
+
+When time travelling, the snapshot schema should be used (true).

Review Comment:
   It's OK as is, I do think adding that historical context from the reference 
implementation is valuable for readers because without it the utility of the 
client side flag is unclear without digging through PRs or discussion threads.
   
   Def not a blocker imo since it's largely describing context and not 
prescribing behavior, if others think this context is useful I think we can 
address this in a follow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] [API + Avro] Add default value APIs and Avro implementation [iceberg]

2024-09-09 Thread via GitHub


manuzhang commented on code in PR #9502:
URL: https://github.com/apache/iceberg/pull/9502#discussion_r1751206372


##
core/src/test/java/org/apache/iceberg/avro/TestReadDefaultValues.java:
##
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.avro;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.avro.generic.GenericData.Record;
+import org.apache.iceberg.Files;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.SingleValueParser;
+import org.apache.iceberg.data.GenericRecord;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+public class TestReadDefaultValues {
+
+  @Rule public TemporaryFolder temp = new TemporaryFolder();
+
+  private static final Object[][] typesWithDefaults =
+  new Object[][] {
+{Types.BooleanType.get(), "true"},
+{Types.IntegerType.get(), "1"},
+{Types.LongType.get(), "999"},
+{Types.FloatType.get(), "1.23"},
+{Types.DoubleType.get(), "123.456"},
+{Types.DateType.get(), "\"2007-12-03\""},
+{Types.TimeType.get(), "\"10:15:30\""},
+{Types.TimestampType.withoutZone(), "\"2007-12-03T10:15:30\""},
+{Types.TimestampType.withZone(), "\"2007-12-03T10:15:30+00:00\""},
+{Types.StringType.get(), "\"foo\""},
+{Types.UUIDType.get(), "\"eb26bdb1-a1d8-4aa6-990e-da940875492c\""},
+{Types.FixedType.ofLength(2), "\"111f\""},
+{Types.BinaryType.get(), "\"ff\""},
+{Types.DecimalType.of(9, 4), "\"123.4500\""},
+{Types.DecimalType.of(9, 0), "\"2\""},
+// Avro doesn't support negative scale
+// {Types.DecimalType.of(9, -20), "\"2E+20\""},
+{Types.ListType.ofOptional(1, Types.IntegerType.get()), "[1, 2, 3]"},
+{
+  Types.MapType.ofOptional(2, 3, Types.IntegerType.get(), 
Types.StringType.get()),
+  "{\"keys\": [1, 2], \"values\": [\"foo\", \"bar\"]}"
+},
+{
+  Types.StructType.of(
+  required(4, "f1", Types.IntegerType.get()),
+  optional(5, "f2", Types.StringType.get())),
+  "{\"4\": 1, \"5\": \"bar\"}"
+},
+// deeply nested complex types
+{
+  Types.ListType.ofOptional(
+  6,
+  Types.StructType.of(
+  required(7, "f1", Types.IntegerType.get()),
+  optional(8, "f2", Types.StringType.get(,
+  "[{\"7\": 1, \"8\": \"bar\"}, {\"7\": 2, \"8\": " + "\"foo\"}]"
+},
+{
+  Types.MapType.ofOptional(
+  9,
+  10,
+  Types.IntegerType.get(),
+  Types.StructType.of(
+  required(11, "f1", Types.IntegerType.get()),
+  optional(12, "f2", Types.StringType.get(,
+  "{\"keys\": [1, 2], \"values\": [{\"11\": 1, \"12\": \"bar\"}, 
{\"11\": 2, \"12\": \"foo\"}]}"
+},
+{
+  Types.StructType.of(
+  required(
+  13,
+  "f1",
+  Types.StructType.of(
+  optional(14, "ff1", Types.IntegerType.get()),
+  optional(15, "ff2", Types.StringType.get(,
+  optional(
+  16,
+  "f2",
+  Types.StructType.of(
+  optional(17, "ff1", Types.StringType.get()),
+  optional(18, "ff2", Types.IntegerType.get(),
+  "{\"13\": {\"14\": 1, \"15\": \"bar\"}, \"16\": {\"17\": \"bar\", 
\"18\": 1}}"
+},
+  };
+
+  @Test
+  public void testDefaultValueApplied() throws IOException {
+for (Object[] typeWith

Re: [PR] feat: Reassign field ids for schema [iceberg-rust]

2024-09-09 Thread via GitHub


Xuanwo commented on PR #615:
URL: https://github.com/apache/iceberg-rust/pull/615#issuecomment-2339690333

   cc @liurenjie1024 would you like to take a look too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core, Kafka, Spark: Use AssertJ instead of JUnit assertions [iceberg]

2024-09-09 Thread via GitHub


nastra merged PR #11102:
URL: https://github.com/apache/iceberg/pull/11102


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support lower_bound&&upper_bound for parquet writer [iceberg-rust]

2024-09-09 Thread via GitHub


xxchan commented on code in PR #383:
URL: https://github.com/apache/iceberg-rust/pull/383#discussion_r1751316492


##
crates/iceberg/src/arrow/schema.rs:
##
@@ -35,6 +35,9 @@ use rust_decimal::prelude::ToPrimitive;
 use std::collections::HashMap;
 use std::sync::Arc;
 
+/// When iceberg map type convert to Arrow map type, the default map field 
name is "key_value".
+pub(crate) const DEFAULT_MAP_FIELD_NAME: &str = "key_value";

Review Comment:
   May I ask what's the reason of changing `entries` to `key_value`? :eyes:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core, Kafka, Spark: Use AssertJ instead of JUnit assertions [iceberg]

2024-09-09 Thread via GitHub


findepi commented on PR #11102:
URL: https://github.com/apache/iceberg/pull/11102#issuecomment-2339774571

   thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support Vended Credentials for Azure Data Lake Store [iceberg-python]

2024-09-09 Thread via GitHub


c-thiel commented on issue #1146:
URL: 
https://github.com/apache/iceberg-python/issues/1146#issuecomment-2339797459

   @sungwy in my comment as well as in my comment I am using "adls.sas-token" 
which is exactly what Java and Spark expect:
   
https://github.com/apache/iceberg/blob/4873b4b7534de0bcda2e1e0366ffcf83943dc906/azure/src/main/java/org/apache/iceberg/azure/AzureProperties.java#L33
 - and what our catalog provides.
   We can easily send both from catalog side - but it would be great if we 
wouldn't have to.
   
   Is there a reason for pyiceberg not using the same prefix as java?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org