[PR] Core: Add Catalog Transactions API [iceberg]
nastra opened a new pull request, #6948: URL: https://github.com/apache/iceberg/pull/6948 I have written up a design doc, which is available [here](https://docs.google.com/document/d/1UxXifU8iqP_byaW4E2RuKZx1nobxmAvc5urVcWas1B8/edit?usp=sharing) I think eventually we'd want to split this into 2 PRs, so that the introduced APIs can be reviewed independently from the REST-specific things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Reassign field ids for schema [iceberg-rust]
c-thiel commented on code in PR #615: URL: https://github.com/apache/iceberg-rust/pull/615#discussion_r1749940779 ## crates/iceberg/src/spec/schema.rs: ## @@ -86,6 +87,16 @@ impl SchemaBuilder { self } +/// Reassign all field-ids (nested) on build. +/// If `start_from` is provided, it will start reassigning from that id (inclusive). +/// If not provided, it will start from 0. +/// +/// All specified aliases and identifier fields will be updated to the new field-ids. +pub fn with_reassigned_field_ids(mut self, start_from: Option) -> Self { +self.reassign_field_ids_from = Some(start_from.unwrap_or(0)); Review Comment: @Xuanwo only doubt with the i32 is that people might input 1 as something is required - and it's shorter than Default::default(). The rust tests are quite a good example that this can happen: We have many tests where field numbering starts from 1, so I had to re-write them to be compatible with the new builder. I am OK with the i32 as well, just wanted to bring up that a None is probably more verbose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] DRAFT - Issue 10275 - Reward support for nulls [iceberg]
nastra commented on code in PR #10953: URL: https://github.com/apache/iceberg/pull/10953#discussion_r1749992612 ## arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java: ## @@ -140,14 +141,16 @@ public static class ConstantVectorHolder extends VectorHolder { private final int numRows; public ConstantVectorHolder(int numRows) { + super(new NullVector("_dummy_", numRows), null, null); this.numRows = numRows; this.constantValue = null; } public ConstantVectorHolder(Types.NestedField icebergField, int numRows, T constantValue) { - super(icebergField); + super(new NullVector(icebergField.name(), numRows), icebergField, new NullabilityHolder(numRows)); this.numRows = numRows; this.constantValue = constantValue; + this.vector = new NullVector(icebergField.name(), numRows); Review Comment: I don't think we should be creating two instances of `NullVector`. Either use the one you pass via `super(...)` or this one, but not both -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] DRAFT - Issue 10275 - Reward support for nulls [iceberg]
nastra commented on code in PR #10953: URL: https://github.com/apache/iceberg/pull/10953#discussion_r1749997916 ## arrow/src/test/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactoryTest.java: ## @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.arrow.vectorized; + +import static org.assertj.core.api.Assertions.assertThat; +import static org.assertj.core.api.Assertions.assertThatThrownBy; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.when; + +import java.math.BigDecimal; +import java.util.function.Supplier; +import org.apache.arrow.vector.IntVector; +import org.apache.iceberg.types.Types; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.schema.PrimitiveType; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.mockito.InjectMocks; +import org.mockito.Mock; +import org.mockito.MockitoAnnotations; + +class GenericArrowVectorAccessorFactoryTest { + @Mock + Supplier> decimalFactorySupplier; + + @Mock Supplier> stringFactorySupplier; + + @Mock + Supplier> + structChildFactorySupplier; + + @Mock + Supplier> arrayFactorySupplier; + + @InjectMocks GenericArrowVectorAccessorFactory genericArrowVectorAccessorFactory; + + @BeforeEach + void before() { +MockitoAnnotations.openMocks(this); + } + + @Test + void testGetVectorAccessorWithIntVector() { +IntVector vector = mock(IntVector.class); +when(vector.get(0)).thenReturn(88); + +Types.NestedField nestedField = Types.NestedField.optional(0, "a1", Types.IntegerType.get()); +ColumnDescriptor columnDescriptor = +new ColumnDescriptor( +new String[] {nestedField.name()}, PrimitiveType.PrimitiveTypeName.INT32, 0, 1); +NullabilityHolder nullabilityHolder = new NullabilityHolder(1); +VectorHolder vectorHolder = +new VectorHolder(columnDescriptor, vector, false, null, nullabilityHolder, nestedField); +ArrowVectorAccessor actual = genericArrowVectorAccessorFactory.getVectorAccessor(vectorHolder); +assertThat(actual).isNotNull(); +assertThat(actual).isInstanceOf(ArrowVectorAccessor.class); +int intValue = actual.getInt(0); +assertThat(intValue).isEqualTo(88); + } + + @Test + void testGetVectorAccessorWithNullVector() { +assertThatThrownBy( +() -> { Review Comment: the {} can be removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] DRAFT - Issue 10275 - Reward support for nulls [iceberg]
nastra commented on code in PR #10953: URL: https://github.com/apache/iceberg/pull/10953#discussion_r1749998444 ## arrow/src/test/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactoryTest.java: ## @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.arrow.vectorized; + +import static org.assertj.core.api.Assertions.assertThat; +import static org.assertj.core.api.Assertions.assertThatThrownBy; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.when; + +import java.math.BigDecimal; +import java.util.function.Supplier; +import org.apache.arrow.vector.IntVector; +import org.apache.iceberg.types.Types; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.schema.PrimitiveType; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.mockito.InjectMocks; +import org.mockito.Mock; +import org.mockito.MockitoAnnotations; + +class GenericArrowVectorAccessorFactoryTest { + @Mock + Supplier> decimalFactorySupplier; + + @Mock Supplier> stringFactorySupplier; + + @Mock + Supplier> + structChildFactorySupplier; + + @Mock + Supplier> arrayFactorySupplier; + + @InjectMocks GenericArrowVectorAccessorFactory genericArrowVectorAccessorFactory; + + @BeforeEach + void before() { +MockitoAnnotations.openMocks(this); + } + + @Test + void testGetVectorAccessorWithIntVector() { +IntVector vector = mock(IntVector.class); +when(vector.get(0)).thenReturn(88); + +Types.NestedField nestedField = Types.NestedField.optional(0, "a1", Types.IntegerType.get()); +ColumnDescriptor columnDescriptor = +new ColumnDescriptor( +new String[] {nestedField.name()}, PrimitiveType.PrimitiveTypeName.INT32, 0, 1); +NullabilityHolder nullabilityHolder = new NullabilityHolder(1); +VectorHolder vectorHolder = +new VectorHolder(columnDescriptor, vector, false, null, nullabilityHolder, nestedField); +ArrowVectorAccessor actual = genericArrowVectorAccessorFactory.getVectorAccessor(vectorHolder); +assertThat(actual).isNotNull(); +assertThat(actual).isInstanceOf(ArrowVectorAccessor.class); Review Comment: ```suggestion assertThat(actual).isNotNull().isInstanceOf(ArrowVectorAccessor.class); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] TableMetadataBuilder [iceberg-rust]
c-thiel commented on PR #587: URL: https://github.com/apache/iceberg-rust/pull/587#issuecomment-2337769487 @liurenjie1024 thanks for the Feedback! > > My first point of the opening statement: Do we re-write our SortOrder and add the schema to PartitionSpec so that we can match on names like Java does or not? > > To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later. The problem in changing it later is that it changes the semantic of the function. Right now we expect source_id to match the `current_schema()` (which is also the reason why I expose it during build). Java doesn't do this, instead, it looks up the name by id in the schema bound to a SortOrder or PartitionSpec, and then searches for the same name in the new schema to use it. In my opinion ids are much cleaner than names (we might have dropped and re-added a column with the same name in the meantime), so I am OK with going forward. However, moving over to java semantics will require new endpoints (i.e. `add_migrate_partition_spec` or so), which takes a bound partition spec in contrast to the unbound spec we currently pass in. Give me a thumbs up if that's OK for you. I'll also open a discussion in the dev channel to get some more opinions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Feat: Normalize TableMetadata [iceberg-rust]
c-thiel commented on PR #611: URL: https://github.com/apache/iceberg-rust/pull/611#issuecomment-2337827094 > Thank you very much for working on this. I have a few style suggestions, but please feel free to disregard them if they don't suit you. Thanks @Xuanwo, they all make sense! Addressed in [Improve readability & comments](https://github.com/apache/iceberg-rust/pull/611/commits/65074a1ea7e3eaaf52ca148bbf4fda7ecbda0c8c) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Case sensitivity is not respected when using IcebergGenerics.ScanBuilder [iceberg]
mderoy closed issue #8178: Case sensitivity is not respected when using IcebergGenerics.ScanBuilder URL: https://github.com/apache/iceberg/issues/8178 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3, 3.4: Fix incorrect catalog loaded in TestCreateActions [iceberg]
nastra commented on code in PR #11049: URL: https://github.com/apache/iceberg/pull/11049#discussion_r1750181394 ## spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestCreateActions.java: ## @@ -728,6 +733,8 @@ public void testStructOfThreeLevelLists() throws Exception { @Test public void testTwoLevelList() throws IOException { +Assume.assumeTrue("Cannot migrate to a hadoop based catalog", !type.equals("hadoop")); Review Comment: why not use Assumptions from AssertJ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Library public api isolation and import decoupling [iceberg-python]
ndrluis closed issue #499: Library public api isolation and import decoupling URL: https://github.com/apache/iceberg-python/issues/499 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Library public api isolation and import decoupling [iceberg-python]
ndrluis commented on issue #499: URL: https://github.com/apache/iceberg-python/issues/499#issuecomment-2338019734 I'll close this issue because we added the mypy linter, which solves our problem with coupling, and we have #1099 to solve the public API definition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Feat: Normalize TableMetadata [iceberg-rust]
Xuanwo merged PR #611: URL: https://github.com/apache/iceberg-rust/pull/611 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] TableMetadataBuilder [iceberg-rust]
Xuanwo commented on PR #587: URL: https://github.com/apache/iceberg-rust/pull/587#issuecomment-2338031867 I have reviewed most PRs that I am confident can be merged. The only one left is https://github.com/apache/iceberg-rust/pull/615, for which I need more input. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Remove Hive 2 [iceberg]
nastra commented on code in PR #10996: URL: https://github.com/apache/iceberg/pull/10996#discussion_r1750198331 ## mr/src/main/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergObjectInspector.java: ## @@ -27,33 +27,23 @@ import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory; import org.apache.iceberg.Schema; import org.apache.iceberg.common.DynMethods; -import org.apache.iceberg.hive.HiveVersion; import org.apache.iceberg.types.Type; import org.apache.iceberg.types.TypeUtil; import org.apache.iceberg.types.Types; public final class IcebergObjectInspector extends TypeUtil.SchemaVisitor { - // get the correct inspectors depending on whether we're working with Hive2 or Hive3 dependencies - // we need to do this because there is a breaking API change in Date/TimestampObjectInspector - // between Hive2 and Hive3 private static final String DATE_INSPECTOR_CLASS = - HiveVersion.min(HiveVersion.HIVE_3) - ? "org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspectorHive3" - : "org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspector"; + "org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspector"; Review Comment: is this the correct one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Remove Hive 2 [iceberg]
nastra commented on code in PR #10996: URL: https://github.com/apache/iceberg/pull/10996#discussion_r1750200212 ## mr/src/main/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergObjectInspector.java: ## @@ -27,33 +27,23 @@ import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory; import org.apache.iceberg.Schema; import org.apache.iceberg.common.DynMethods; -import org.apache.iceberg.hive.HiveVersion; import org.apache.iceberg.types.Type; import org.apache.iceberg.types.TypeUtil; import org.apache.iceberg.types.Types; public final class IcebergObjectInspector extends TypeUtil.SchemaVisitor { - // get the correct inspectors depending on whether we're working with Hive2 or Hive3 dependencies - // we need to do this because there is a breaking API change in Date/TimestampObjectInspector - // between Hive2 and Hive3 private static final String DATE_INSPECTOR_CLASS = - HiveVersion.min(HiveVersion.HIVE_3) - ? "org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspectorHive3" - : "org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspector"; + "org.apache.iceberg.mr.hive.serde.objectinspector.IcebergDateObjectInspector"; Review Comment: maybe just rename those classes so that they don't have the `Hive` suffix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] scan: fix error when reading an empty table [iceberg-rust]
sdd commented on PR #608: URL: https://github.com/apache/iceberg-rust/pull/608#issuecomment-2338064061 Just to clarify, not having any snapshots is not necessarily the same as not having any data. If there is no current snapshot then there can't be any data, but someone could delete all data from a table, resulting in there being a snapshot, but no data. The existing code would handle this second case just fine - we only need to handle the issue of no snapshots. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add support for `view-default` property in catalog [iceberg]
nastra commented on code in PR #11064: URL: https://github.com/apache/iceberg/pull/11064#discussion_r1750254414 ## open-api/src/testFixtures/java/org/apache/iceberg/rest/RCKUtils.java: ## @@ -85,7 +85,8 @@ static RESTCatalog initCatalogClient() { catalogProperties.putIfAbsent( CatalogProperties.URI, String.format("http://localhost:%s/";, RESTCatalogServer.REST_PORT_DEFAULT)); -catalogProperties.putIfAbsent(CatalogProperties.WAREHOUSE_LOCATION, "rck_warehouse"); Review Comment: is this intentionally being removed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add support for `view-default` property in catalog [iceberg]
nastra commented on code in PR #11064: URL: https://github.com/apache/iceberg/pull/11064#discussion_r1750255921 ## core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java: ## @@ -107,6 +108,7 @@ public void basicCreateView() { assertThat(view.currentVersion().operation()).isEqualTo("create"); assertThat(view.schemas()).hasSize(1).containsKey(0); assertThat(view.versions()).hasSize(1).containsExactly(view.currentVersion()); +assertThat(view.properties()).contains(entry("key1", "catalog-default-key1")); Review Comment: ```suggestion assertThat(view.properties()).containsEntry("key1", "catalog-default-key1"); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration [iceberg]
pvary commented on code in PR #10926: URL: https://github.com/apache/iceberg/pull/10926#discussion_r1750267503 ## core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java: ## @@ -63,7 +63,11 @@ public class HadoopFileIO implements HadoopConfigurable, DelegateFileIO { * {@link Configuration Hadoop configuration} must be set through {@link * HadoopFileIO#setConf(Configuration)} */ - public HadoopFileIO() {} + public HadoopFileIO() { +// Create a default hadoopConf as it is required for the object to be valid. +// E.g. newInputFile would throw NPE with hadoopConf.get() otherwise. +this.hadoopConf = new SerializableConfiguration(new Configuration())::get; Review Comment: @stevenzwu: > if I understand correctly, your point is that FileIO probably shouldn't be part of the task state for BaseEntriesTable.ManifestReadTask, AllManifestsTable.ManifestListReadTask? [..] It seems like a large/challenging refactoring. Looking for other folks' take on this. Yes, this seems strange, and problematic to me. I was not able to find an easy solution yet. I was hoping, others with better knowledge might have some ideas. > Regardless, the issue remains that FileIOParser doesn't serialize HadoopFileIO faithfully. I don't know if REST catalog has any need to use it to JSON serialize FileIO in the future. My point here is that, since we use `Configuration(false)` in some cases, and the way how the current serialization works, we already `doesn't serialize HadoopFileIO faithfully`. So If we don't find a solution for getting rid of the FileIO, we might as well write our own "unfaithful" serialization which mimics the way how the current serialization works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Avoid metaspace memory leak by not registering ShutdownHook for ExecutorService in Flink [iceberg]
pvary commented on PR #11073: URL: https://github.com/apache/iceberg/pull/11073#issuecomment-2338174338 @fengjiajie: Does this mean, that you have submitting 1091 jobs in the span of 2 minutes? I see that the default timeout for the `MoreExecutors.getExitingExecutorService` is 120s ``` final ExecutorService getExitingExecutorService(ThreadPoolExecutor executor) { return this.getExitingExecutorService(executor, 120L, TimeUnit.SECONDS); } ``` Or is this a guava issue, that the application is not cleared even after the timeout is reached? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Add configuration and set better defaults for S3 retry behaviour [iceberg]
nastra commented on code in PR #11052: URL: https://github.com/apache/iceberg/pull/11052#discussion_r1750335281 ## aws/src/test/java/org/apache/iceberg/aws/s3/TestS3FileIOProperties.java: ## @@ -491,4 +493,17 @@ public void testApplyUserAgentConfigurations() { Mockito.verify(mockS3ClientBuilder) .overrideConfiguration(Mockito.any(ClientOverrideConfiguration.class)); } + + @Test + public void testApplyRetryConfiguration() { +Map properties = Maps.newHashMap(); +properties.put(S3FileIOProperties.S3_RETRY_NUM_RETRIES, "999"); +S3FileIOProperties s3FileIOProperties = new S3FileIOProperties(properties); + +S3ClientBuilder builder = S3Client.builder(); +s3FileIOProperties.applyRetryConfigurations(builder); + +RetryPolicy retryPolicy = builder.overrideConfiguration().retryPolicy().get(); +assertThat(retryPolicy.numRetries()).withFailMessage("retries was not set").isEqualTo(999); Review Comment: ```suggestion assertThat(retryPolicy.numRetries()).as("retries was not set").isEqualTo(999); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark Streaming Job with multiple queries MERGE INTO the same target table (Runtime file filtering is not possible) [iceberg]
RussellSpitzer commented on issue #11094: URL: https://github.com/apache/iceberg/issues/11094#issuecomment-2338247842 > @eric-maynard I have other examples where I do what you suggest. If I move this inside `start_streaming_query` the builder will simply return the existing spark session `getOrCreate` it will not create a new spark application. > > When I do so I see the same behavior. You need to create a new session object, not get the active one like so: ```scala scala> spark.newSession() res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@5d643896 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark configuration for amazon access key and secret key with glue catalog for apache Iceberg is not honoring [iceberg]
andythsu commented on issue #10078: URL: https://github.com/apache/iceberg/issues/10078#issuecomment-2338251099 @clamar14 @nastra I ended up using different jars and putting everything together instead of using `iceberg-aws-bundle` ``` def get_builder() -> SparkSession.Builder: return ( SparkSession.builder \ .config( "spark.jars", ( f"{JARS_BASE_PATH}/iceberg-spark-runtime-3.4_2.12-1.6.0.jar" f",{JARS_BASE_PATH}/iceberg-spark-extensions-3.4_2.12-1.6.0.jar" f",{JARS_BASE_PATH}/hadoop-aws-3.3.2.jar" # needed for hadoop s3 file system f",{JARS_BASE_PATH}/aws-java-sdk-bundle-1.12.769.jar" f",{JARS_BASE_PATH}/protobuf-java-3.2.0.jar" ) ... ) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: add option to force columns to lowercase [iceberg]
bryanck commented on PR #11100: URL: https://github.com/apache/iceberg/pull/11100#issuecomment-2338262871 Note, this addresses https://github.com/apache/iceberg/issues/11091 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.3, 3.4: Fix incorrect catalog loaded in TestCreateActions [iceberg]
nastra merged PR #11049: URL: https://github.com/apache/iceberg/pull/11049 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: add option to force columns to lowercase [iceberg]
bryanck commented on code in PR #11100: URL: https://github.com/apache/iceberg/pull/11100#discussion_r1750375359 ## kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/IcebergSinkConfig.java: ## @@ -79,6 +79,8 @@ public class IcebergSinkConfig extends AbstractConfig { "iceberg.tables.schema-force-optional"; private static final String TABLES_SCHEMA_CASE_INSENSITIVE_PROP = "iceberg.tables.schema-case-insensitive"; + private static final String TABLES_SCHEMA_FORCE_COLUMNS_TO_LOWERCASE_PROP = + "iceberg.tables.schema-force-columns-to-lowercase"; Review Comment: I think `iceberg.tables.schema-force-lowercase` is sufficient and consistent with `schema-force-optional`. Also, moving this up one to be below `schema-force optional` would be preferable. ## kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/IcebergSinkConfig.java: ## @@ -162,6 +164,12 @@ private static ConfigDef newConfigDef() { false, Importance.MEDIUM, "Set to true to look up table columns by case-insensitive name, false for case-sensitive"); +configDef.define( +TABLES_SCHEMA_FORCE_COLUMNS_TO_LOWERCASE_PROP, +ConfigDef.Type.BOOLEAN, +false, +Importance.MEDIUM, +"Set to true to force table column names to lowercase, false for preserve case"); Review Comment: Same note here on name and moving it up one element to be next to "force optional". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Kafka Connect: Record projection Index out of bounds error [iceberg]
bryanck commented on issue #11099: URL: https://github.com/apache/iceberg/issues/11099#issuecomment-2338355847 Thanks for reporting this, does this require a fix or was it something else? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]
isc-patrick opened a new issue, #1148: URL: https://github.com/apache/iceberg-python/issues/1148 ### Feature Request / Improvement I think there are 2 issues with the SQLCatalog constructor: https://github.com/apache/iceberg-python/blob/d587e6724685744918ecf192724437182ad01abf/pyiceberg/catalog/sql.py#L117 1. The automatic creation of non-existent SQLCatalog tables is great for testing, but not something I would want in code that was being deployed and run even in a QA environ. Applying DDL should be part of a seperate process. Constructor should at least take a parameter that allows you to not create the tables automatically. 2. The current method used for creating the tables in _ensure_tables_exists uses exceptions which means that the library is compatible with SQLite and Postgres simply because the except clause handles the errors these DB's throw when selecting from a non-existant DB. The logic here should just check for the existence of the tables through a standard SQLAlchemy method, which will also make the SQLCatalog work with most DBs I have made these changes on my local branch and can submit a PR but wanted to open up conversation to other opinions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix the behavior of IncrementalFileCleanup when expire a snapshot [iceberg]
amogh-jahagirdar merged PR #10983: URL: https://github.com/apache/iceberg/pull/10983 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
danielcweeks commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750551470 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: iirc, there are two reasons: 1. We want the client to be explicit about which snapshot they want to read 2. This simplifies the request because there's only one way to read historical data -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750572264 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: Yeah, clients currently specify the snapshot ID however there needs to be a mechanism for distinguishing which schema gets used based on if it's a time travel by a specific snapshot ID or if it's a time travel by branch. The client has that context, and it's easier for it to determine which schema should be used. The request input is kept simpler by having just a snapshot ID for time travel as @danielcweeks said rather than having a mix of different options. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750583821 ## open-api/rest-catalog-open-api.yaml: ## @@ -541,6 +541,264 @@ paths: 5XX: $ref: '#/components/responses/ServerErrorResponse' + /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/table' +post: + tags: +- Catalog API + summary: Submit a scan for planning + description: > +Submits a scan for server-side planning. + + +Point-in-time scans are planned by passing snapshot-id to identify the +table snapshot to scan. Incremental scans are planned by passing both +start-snapshot-id and end-snapshot-id. Requests that include both point +in time config properties and incremental config properties are +invalid. If the request does not include either incremental or +point-in-time config properties, scan planning should produce a +point-in-time scan of the latest snapshot in the table's main branch. + + +Responses must include a valid status + +- When "completed" the planning operation has produced plan tasks and + file scan tasks that must be returned in the response (not fetched + later by calling fetchPlanningResult) + +- When "submitted" the response must include a plan-id used to poll + fetchPlanningResult to fetch the planning result when it is ready + +- When "failed" the response must be a valid error response + +- Status "cancelled" is not a valid status from this endpoint + + +The response for a "completed" planning operation includes two types of +tasks (file scan tasks and plan tasks) and both may be included in the +response. Tasks must not be included for any other response status. + + +Responses that include a plan-id indicate that the service is holding +state or performing work for the client. + + +- Clients should use the plan-id to fetch results from + fetchPlanningResult when the response status is "submitted" + +- Clients should inform the service if planning results are no longer + needed by calling cancelPlanning. Cancellation is not necessary after + fetchScanTasks has been used to fetch scan tasks for each plan task. + operationId: planTableScan + requestBody: +content: + application/json: +schema: + $ref: '#/components/schemas/PlanTableScanRequest' + responses: +200: + $ref: '#/components/responses/PlanTableScanResponse' +400: + $ref: '#/components/responses/BadRequestErrorResponse' +401: + $ref: '#/components/responses/UnauthorizedResponse' +403: + $ref: '#/components/responses/ForbiddenResponse' +404: + description: +Not Found +- NoSuchTableException, the table does not exist +- NoSuchNamespaceException, the namespace does not exist + content: +application/json: + schema: +$ref: '#/components/schemas/IcebergErrorResponse' + examples: +TableDoesNotExist: + $ref: '#/components/examples/NoSuchTableError' +NamespaceDoesNotExist: + $ref: '#/components/examples/NoSuchNamespaceError' +406: + $ref: '#/components/responses/UnsupportedOperationResponse' +419: + $ref: '#/components/responses/AuthenticationTimeoutResponse' +503: + $ref: '#/components/responses/ServiceUnavailableResponse' +5XX: + $ref: '#/components/responses/ServerErrorResponse' + + /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/table' + - $ref: '#/components/parameters/plan-id' + +get: + tags: +- Catalog API + summary: Fetches the result of scan planning for a plan-id + operationId: fetchPlanningResult + description: > +Fetches the result of scan planning for a plan-id. + + +Responses must include a valid status + +- When "completed" the planning operation has produced plan-tasks and + file-scan-tasks that must be returned in the response + +- When "submitted" the planning operation has not completed; the client + should wait to call this endpoint again to fetch a completed response + +- When "failed" the response must be a valid error response + +- When "cancelled" the plan-id is invalid and should be discarded + + +The respons
Re: [I] Data files which are still useful are mistakenly cleaned up when trying to expire a specified snapshot [iceberg]
hantangwangd closed issue #10982: Data files which are still useful are mistakenly cleaned up when trying to expire a specified snapshot URL: https://github.com/apache/iceberg/issues/10982 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
flyrain commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750651482 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: Thanks @danielcweeks and @amogh-jahagirdar for inputs. Sorry I didn't realized there are differences between a time travel by a specific snapshot ID and by branch name, in terms of which schema to use. To help clarify, could we add a link here for further reading, or provide a detailed explanation? For example: ``` - When scanning with a snapshot ID, clients should use the snapshot schema (true). - When scanning with a branch name, clients should use the table schema (false). Note that clients still send the snapshot ID rather than the branch name in the request. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] DRAFT - Issue 10275 - Reward support for nulls [iceberg]
slessard commented on code in PR #10953: URL: https://github.com/apache/iceberg/pull/10953#discussion_r1750566040 ## arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java: ## @@ -262,6 +264,111 @@ public void testReadColumnFilter2() throws Exception { scan, NUM_ROWS_PER_MONTH, 12 * NUM_ROWS_PER_MONTH, ImmutableList.of("timestamp")); } + @Test + public void testReadColumnThatDoesNotExistInParquetSchema() throws Exception { +setMaxStackTraceElementsDisplayed(15); Review Comment: This tests the exact scenario that is the impetus for adding support for null vectors. I feel this test is essential. This test is the only test exercising this pull request. If you see things differently, please explain. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
flyrain commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750675084 ## open-api/rest-catalog-open-api.yaml: ## @@ -541,6 +541,264 @@ paths: 5XX: $ref: '#/components/responses/ServerErrorResponse' + /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/table' +post: + tags: +- Catalog API + summary: Submit a scan for planning + description: > +Submits a scan for server-side planning. + + +Point-in-time scans are planned by passing snapshot-id to identify the +table snapshot to scan. Incremental scans are planned by passing both +start-snapshot-id and end-snapshot-id. Requests that include both point +in time config properties and incremental config properties are +invalid. If the request does not include either incremental or +point-in-time config properties, scan planning should produce a +point-in-time scan of the latest snapshot in the table's main branch. + + +Responses must include a valid status + +- When "completed" the planning operation has produced plan tasks and + file scan tasks that must be returned in the response (not fetched + later by calling fetchPlanningResult) + +- When "submitted" the response must include a plan-id used to poll + fetchPlanningResult to fetch the planning result when it is ready + +- When "failed" the response must be a valid error response + +- Status "cancelled" is not a valid status from this endpoint + + +The response for a "completed" planning operation includes two types of +tasks (file scan tasks and plan tasks) and both may be included in the +response. Tasks must not be included for any other response status. + + +Responses that include a plan-id indicate that the service is holding +state or performing work for the client. + + +- Clients should use the plan-id to fetch results from + fetchPlanningResult when the response status is "submitted" + +- Clients should inform the service if planning results are no longer + needed by calling cancelPlanning. Cancellation is not necessary after + fetchScanTasks has been used to fetch scan tasks for each plan task. + operationId: planTableScan + requestBody: +content: + application/json: +schema: + $ref: '#/components/schemas/PlanTableScanRequest' + responses: +200: + $ref: '#/components/responses/PlanTableScanResponse' +400: + $ref: '#/components/responses/BadRequestErrorResponse' +401: + $ref: '#/components/responses/UnauthorizedResponse' +403: + $ref: '#/components/responses/ForbiddenResponse' +404: + description: +Not Found +- NoSuchTableException, the table does not exist +- NoSuchNamespaceException, the namespace does not exist + content: +application/json: + schema: +$ref: '#/components/schemas/IcebergErrorResponse' + examples: +TableDoesNotExist: + $ref: '#/components/examples/NoSuchTableError' +NamespaceDoesNotExist: + $ref: '#/components/examples/NoSuchNamespaceError' +406: + $ref: '#/components/responses/UnsupportedOperationResponse' +419: + $ref: '#/components/responses/AuthenticationTimeoutResponse' +503: + $ref: '#/components/responses/ServiceUnavailableResponse' +5XX: + $ref: '#/components/responses/ServerErrorResponse' + + /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/table' + - $ref: '#/components/parameters/plan-id' + +get: + tags: +- Catalog API + summary: Fetches the result of scan planning for a plan-id + operationId: fetchPlanningResult + description: > +Fetches the result of scan planning for a plan-id. + + +Responses must include a valid status + +- When "completed" the planning operation has produced plan-tasks and + file-scan-tasks that must be returned in the response + +- When "submitted" the planning operation has not completed; the client + should wait to call this endpoint again to fetch a completed response + +- When "failed" the response must be a valid error response + +- When "cancelled" the plan-id is invalid and should be discarded + + +The response for a "
Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]
sungwy commented on issue #1148: URL: https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338726996 Hi @isc-patrick - thank you for raising this. These are great points. I'm wondering if we can keep the current approach of running `_ensure_table_exists` within the instantiation of the SqlCatalog, but adopt an improved way of running the table creation DDL as you suggested in (2). The reason is, because I think as long as the function is idempotent (similar to `create_table_if_not_exists` or `CREATE TABLE IF NOT EXISTS`), it's simple enough to keep in the initialization of the SqlCatalog, which requires these tables to execute any of its member functions. Looking at the SqlAlchemy dialect further, I'm of the impression that we could instead just remove the additional try exception handling and just invoke `SqlCatalogBaseTable.metadata.create_all(self.engine)`. According to the SqlAlchemy docs, this uses the default flag of `checkfirst=True`, which avoids creating the tables if they already exist. https://docs.sqlalchemy.org/en/20/core/metadata.html#sqlalchemy.schema.MetaData.create_all What do you think about this suggestion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg Glue Concurrent Update can result in missing metadata_location [iceberg]
singhpk234 commented on issue #9411: URL: https://github.com/apache/iceberg/issues/9411#issuecomment-2338733735 @shaeqahmed are you seeing this log line too ? ``` LOG.warn( "Received unexpected failure when committing to {}, validating if commit ended up succeeding.", fullTableName, persistFailure); ``` I stumbled upon similar error : [1] If we can't see this log line, so to me seems like Retry detector didn't work as expected and it didn't attempt to reconcile the status [2] If we see the log line this means we saw inconsistent state even after reconciliation, which will be a bit tricky as i am not sure how glue works in this case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
dramaticlly commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750692281 ## open-api/rest-catalog-open-api.yaml: ## @@ -541,6 +541,264 @@ paths: 5XX: $ref: '#/components/responses/ServerErrorResponse' + /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/table' +post: + tags: +- Catalog API + summary: Submit a scan for planning + description: > +Submits a scan for server-side planning. + + +Point-in-time scans are planned by passing snapshot-id to identify the +table snapshot to scan. Incremental scans are planned by passing both +start-snapshot-id and end-snapshot-id. Requests that include both point +in time config properties and incremental config properties are +invalid. If the request does not include either incremental or +point-in-time config properties, scan planning should produce a +point-in-time scan of the latest snapshot in the table's main branch. + + +Responses must include a valid status + +- When "completed" the planning operation has produced plan tasks and + file scan tasks that must be returned in the response (not fetched + later by calling fetchPlanningResult) + +- When "submitted" the response must include a plan-id used to poll + fetchPlanningResult to fetch the planning result when it is ready + +- When "failed" the response must be a valid error response + +- Status "cancelled" is not a valid status from this endpoint + + +The response for a "completed" planning operation includes two types of +tasks (file scan tasks and plan tasks) and both may be included in the +response. Tasks must not be included for any other response status. + + +Responses that include a plan-id indicate that the service is holding +state or performing work for the client. + + +- Clients should use the plan-id to fetch results from + fetchPlanningResult when the response status is "submitted" + +- Clients should inform the service if planning results are no longer + needed by calling cancelPlanning. Cancellation is not necessary after + fetchScanTasks has been used to fetch scan tasks for each plan task. + operationId: planTableScan + requestBody: +content: + application/json: +schema: + $ref: '#/components/schemas/PlanTableScanRequest' + responses: +200: + $ref: '#/components/responses/PlanTableScanResponse' +400: + $ref: '#/components/responses/BadRequestErrorResponse' +401: + $ref: '#/components/responses/UnauthorizedResponse' +403: + $ref: '#/components/responses/ForbiddenResponse' +404: + description: +Not Found +- NoSuchTableException, the table does not exist +- NoSuchNamespaceException, the namespace does not exist + content: +application/json: + schema: +$ref: '#/components/schemas/IcebergErrorResponse' + examples: +TableDoesNotExist: + $ref: '#/components/examples/NoSuchTableError' +NamespaceDoesNotExist: + $ref: '#/components/examples/NoSuchNamespaceError' +406: + $ref: '#/components/responses/UnsupportedOperationResponse' +419: + $ref: '#/components/responses/AuthenticationTimeoutResponse' +503: + $ref: '#/components/responses/ServiceUnavailableResponse' +5XX: + $ref: '#/components/responses/ServerErrorResponse' + + /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/table' + - $ref: '#/components/parameters/plan-id' + +get: + tags: +- Catalog API + summary: Fetches the result of scan planning for a plan-id + operationId: fetchPlanningResult + description: > +Fetches the result of scan planning for a plan-id. + + +Responses must include a valid status + +- When "completed" the planning operation has produced plan-tasks and + file-scan-tasks that must be returned in the response + +- When "submitted" the planning operation has not completed; the client + should wait to call this endpoint again to fetch a completed response + +- When "failed" the response must be a valid error response + +- When "cancelled" the plan-id is invalid and should be discarded + + +The response for
Re: [PR] fix: Invert `case_sensitive` logic in StructType [iceberg-python]
AnthonyLam commented on PR #1147: URL: https://github.com/apache/iceberg-python/pull/1147#issuecomment-2338752007 > Hello @AnthonyLam, thank you for your contribution. Could you add a test to ensure the expected behavior? For sure! I've added a test in `test_types.py`. Let me know if it's insufficient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Bump `duckdb` to version `1.1.0` [iceberg-python]
kevinjqliu opened a new pull request, #1149: URL: https://github.com/apache/iceberg-python/pull/1149 Duckdb version 1.1.0 added support for automatic retries when installing extensions (https://github.com/duckdb/duckdb/pull/13122). This will help resolve the intermittent CI issue observed in #787 ``` poetry add duckdb@1.1.0 --optional ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] support python 3.12 [iceberg-python]
kevinjqliu commented on PR #254: URL: https://github.com/apache/iceberg-python/pull/254#issuecomment-2338806960 hey @MehulBatra, can i close this in favor of #1068? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Python: Add support for Python 3.12 [iceberg-python]
kevinjqliu closed pull request #35: Python: Add support for Python 3.12 URL: https://github.com/apache/iceberg-python/pull/35 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]
isc-patrick commented on issue #1148: URL: https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338829111 My thought was to pass an additional argument into init that allowed you to turn off creation of tables. My concern is that I don't think you will want this auto-create to happen on any system in production, as the catalog should already exist and I really want it to fail fast if it does not exist instead of creating new tables and failing later, like when the system is live. In general, I would never have DDL applied to a DB as part of a dependant applications runtime. So that would mean passing a new boolen into the constructor, say create_if_not_exists: bool = True, before the **properties and then replacing the _ensure_table_exists method with the create_all conditioned on the new bool. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: Add configuration and set better defaults for S3 retry behaviour [iceberg]
ookumuso commented on PR #11052: URL: https://github.com/apache/iceberg/pull/11052#issuecomment-2338851624 Removed all InputStream related changes in favor #10433 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]
sungwy commented on issue #1148: URL: https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338853251 Given that the catalog is instantiated through `load_catalog` entrypoint function, I think we'd actually need to introduce a catalog property as well, if we decide to introduce a new boolean flag to control this behavior. That being said, I'm still not convinced that this is necessary... I think keeping the approach of creating the tables on catalog initialization should result in the following: 1. if the tables exist, the DDL shouldn't be executed 2. if the tables don't exist, they are created 3. if the tables don't exist and the DDL fails to create the tables (e.g. because the DB doesn't allow creation of tables through standard DDLs), the DB should throw an error that should result in the program failing fast. > I really want it to fail fast if it does not exist instead of creating new tables and failing later, like when the system is live I'm not sure if I follow this point @isc-patrick . If we disable the creation of the table through the DDL, we will be failing at the same point, or even later. Since instead of failing at the Catalog initialization (for example by failing to create the tables), we will now be failing at a later part of the code that has assumed that the tables exist, and tries to execute one of the SQL queries for scans or commits. Sorry to probe - could you elaborate on how you expect the proposed approach of suppressing table creation would result in the application failing fast? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Impl rest catalog + table updates & requirements [iceberg-go]
jwtryg opened a new pull request, #146: URL: https://github.com/apache/iceberg-go/pull/146 Hi @zeroshade I think it's really cool that you are working on a golang-iceberg implementation, and I would like to contribute if I can. I have tried to finish the rest catalog implementation, and then I have added table updates and requirements, using a new MetadataBuilder struct that can simplify updating metadata. Like the existing implementation, I have tried to stay close to the pyIceberg implementation. I thought this could be a good starting point for also implementing transactions and table updates. I would love to get your input and change/adjust the implementation accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750815966 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). +When scanning a branch, the table schema should be used (false). + type: boolean + default: false +start-snapshot-id: + description: Starting snapshot ID for an incremental scan (exclusive) + type: integer + format: int64 +end-snapshot-id: + description: +Ending snapshot ID for an incremental scan (inclusive). + +Required when start-snapshot-id is specified. + type: integer + format: int64 +stats-fields: + description: +List of fields for which the service should send column stats. + type: array + items: +$ref: '#/components/schemas/FieldName' + +FieldName: + description: +A full field name (including parent field names), such as those passed +in APIs like Java `Schema#findField(String name)`. + +The nested field name follows these rules +- Nested struct fields are named by concatenating field names at each + struct level using dot (`.`) delimiter, e.g. + employer.contact_info.address.zip_code +- Nested fields in a map key are named using the keyword `key`, e.g. + employee_address_map.key.first_name +- Nested fields in a map value are named using the keyword `value`, + e.g. employee_address_map.value.zip_code +- Nested fields in a list are named using the keyword `element`, e.g. + employees.element.first_name + type: string + +FetchScanTasksRequest: + type: object + required: +- plan-task + properties: +plan-task: + $ref: '#/components/schemas/PlanTask' + +PlanTask: + description: +An opaque string provided by the REST server that represents a +unit of work to produce file scan tasks for scan planning. + type: string Review Comment: I think I largely agree with @flyrain second sentence, that's good to clarify. The one part I think I'd remove is the ``PlanTask` is an opaque string provided by the REST server that indexes file scan tasks.` I don't think the spec should define or even imply that the REST server "indexes" the file scan tasks. It should probably be kept open and implementations can maintain that relationship how they want. I think the current description "An opaque string provided by the REST server that represents a unit of work to produce file scan tasks for scan planning." is good in that it doesn't overspecify anything but describes that the string represents an abstract unit of work which I think is just the right amount of specification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]
isc-patrick commented on issue #1148: URL: https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338957758 I had not yet seen the change introducing load_catalog instead of using the SQLCatalog constructor, so just updated to v0.70 - I was away for couple weeks. My point for the creation of the table comes from work with clients, mostly in Financial Services, where there is a formal separation of code that applies DDL vs DML. The entire process of Database change control is seperate from any application lifecycle. The idea that an application using a DB for reading and writing data would be allowed to change the schema would just never be considered as a possibility. Let's just forget that and chalk it up to different approaches to SDLC. This can easily handled by adding a wrapper to load_catalog if necessary. I can make the change to simplify the _ensure_tables_exist to simply call the create_all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Error: `table_type` missing from table parameters when loading table from Hive metastore [iceberg-python]
edgarrmondragon commented on issue #1150: URL: https://github.com/apache/iceberg-python/issues/1150#issuecomment-2338970552 > Who created the table in this case? When PyIceberg creates the table, it injects the `table_type` property I suppose it was created by a third-party and not by `HiveCatalog.create_table`. Are only tables created by pyiceberg supported here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's [iceberg-python]
sungwy commented on issue #1148: URL: https://github.com/apache/iceberg-python/issues/1148#issuecomment-2338975789 I work in financial services as well, and actually have the same separation between DDL and DML. But I agree with your point in that it would be best to not generalize the practices in our industry with other potential development practices in different industries, or at different scales of company (start up versus big firms). > I can make the change to simplify the _ensure_tables_exist to simply call the create_all. That would be amazing @isc-patrick . Maybe we can start there, and see if there's still a need to make the other proposed change. Thank you again for raising this issue and leading this discussion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Error: `table_type` missing from table parameters when loading table from Hive metastore [iceberg-python]
kevinjqliu commented on issue #1150: URL: https://github.com/apache/iceberg-python/issues/1150#issuecomment-2338993156 > Are only tables created by pyiceberg supported here? Anyone can create an iceberg table using HMS, which can be read by PyIceberg. In HMS, the assumption is that iceberg tables have a specific property set so that engines can distinguish between hive and iceberg tables. In this case, the table was created as a "hive table" and not an "iceberg table". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] EPIC: Rust Based Compaction [iceberg-rust]
kevinjqliu commented on issue #624: URL: https://github.com/apache/iceberg-rust/issues/624#issuecomment-2339045843 Thanks for starting this! Linking the relevant issue from the pyiceberg side, [iceberg-python/#1092](https://github.com/apache/iceberg-python/issues/1092) Would be great to collaborate on this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump `duckdb` to version `1.1.0` [iceberg-python]
sungwy commented on code in PR #1149: URL: https://github.com/apache/iceberg-python/pull/1149#discussion_r1750905022 ## pyproject.toml: ## @@ -64,7 +64,7 @@ zstandard = ">=0.13.0,<1.0.0" tenacity = ">=8.2.3,<10.0.0" pyarrow = { version = ">=14.0.0,<18.0.0", optional = true } pandas = { version = ">=1.0.0,<3.0.0", optional = true } -duckdb = { version = ">=0.5.0,<2.0.0", optional = true } +duckdb = {version = "1.1.0", optional = true} Review Comment: I'm not sure how I feel about pinning this to a single version 🤔 Is this necessary to resolve the issues with our CI, given that the new version is out, and our upper limit is already <2.0.0? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump `duckdb` to version `1.1.0` [iceberg-python]
kevinjqliu commented on code in PR #1149: URL: https://github.com/apache/iceberg-python/pull/1149#discussion_r1750910730 ## pyproject.toml: ## @@ -64,7 +64,7 @@ zstandard = ">=0.13.0,<1.0.0" tenacity = ">=8.2.3,<10.0.0" pyarrow = { version = ">=14.0.0,<18.0.0", optional = true } pandas = { version = ">=1.0.0,<3.0.0", optional = true } -duckdb = { version = ">=0.5.0,<2.0.0", optional = true } +duckdb = {version = "1.1.0", optional = true} Review Comment: Good point, not sure about this one. The CI issue can only be resolved by v1.1.0 or higher. I'm not sure if CI will automatically pick the newest version of the library. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]
amogh-jahagirdar commented on code in PR #11086: URL: https://github.com/apache/iceberg/pull/11086#discussion_r1750840473 ## core/src/main/java/org/apache/iceberg/SnapshotProducer.java: ## @@ -554,6 +562,84 @@ protected boolean cleanupAfterCommit() { return true; } + protected List writeDataManifests(List files, PartitionSpec spec) { +return writeDataManifests(files, null /* inherit data seq */, spec); + } + + protected List writeDataManifests( + List files, Long dataSeq, PartitionSpec spec) { +return writeManifests(files, group -> writeDataFileGroup(group, dataSeq, spec)); + } + + private List writeDataFileGroup( + List files, Long dataSeq, PartitionSpec spec) { +RollingManifestWriter writer = newRollingManifestWriter(spec); + +try (RollingManifestWriter closableWriter = writer) { + if (dataSeq != null) { +files.forEach(file -> closableWriter.add(file, dataSeq)); + } else { +files.forEach(closableWriter::add); + } +} catch (IOException e) { + throw new RuntimeIOException(e, "Failed to write data manifests"); +} + +return writer.toManifestFiles(); + } + + protected List writeDeleteManifests( + List files, PartitionSpec spec) { +return writeManifests(files, group -> writeDeleteFileGroup(group, spec)); + } + + private List writeDeleteFileGroup( + List files, PartitionSpec spec) { +RollingManifestWriter writer = newRollingDeleteManifestWriter(spec); + +try (RollingManifestWriter closableWriter = writer) { + for (DeleteFileHolder file : files) { +if (file.dataSequenceNumber() != null) { + closableWriter.add(file.deleteFile(), file.dataSequenceNumber()); +} else { + closableWriter.add(file.deleteFile()); +} + } +} catch (IOException e) { + throw new RuntimeIOException(e, "Failed to write delete manifests"); +} + +return writer.toManifestFiles(); + } + + private static List writeManifests( + List files, Function, List> writeFunc) { +int groupSize = manifestFileGroupSize(ThreadPools.WORKER_THREAD_POOL_SIZE, files.size()); +List> groups = Lists.partition(files, groupSize); +Queue manifests = Queues.newConcurrentLinkedQueue(); +Tasks.foreach(groups) +.stopOnFailure() +.throwFailureWhenFinished() +.executeWith(ThreadPools.getWorkerPool()) +.run(group -> manifests.addAll(writeFunc.apply(group))); +return ImmutableList.copyOf(manifests); + } + + /** + * Calculates how many files can be processed concurrently depending on the provided parallelism + * without creating too small manifests. + * + * @param parallelism the max number of concurrent writers + * @param fileCount the total number of files to be written + * @return the size of each file group, adjusted to balance workload across available writers + */ + @VisibleForTesting + static int manifestFileGroupSize(int parallelism, int fileCount) { +int maxParallelism = IntMath.divide(fileCount, MIN_FILE_GROUP_SIZE, RoundingMode.HALF_UP); +int actualParallelism = Math.max(1, Math.min(parallelism, maxParallelism)); Review Comment: I'm actually OK with the current naming of maxParallelism and actualParallelism. Do we want to just inline comment the handling of the cases where there's not a full manifest file group? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
rahil-c commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750944397 ## open-api/rest-catalog-open-api.yaml: ## @@ -541,6 +541,264 @@ paths: 5XX: $ref: '#/components/responses/ServerErrorResponse' + /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/table' +post: + tags: +- Catalog API + summary: Submit a scan for planning + description: > +Submits a scan for server-side planning. + + +Point-in-time scans are planned by passing snapshot-id to identify the +table snapshot to scan. Incremental scans are planned by passing both +start-snapshot-id and end-snapshot-id. Requests that include both point +in time config properties and incremental config properties are +invalid. If the request does not include either incremental or +point-in-time config properties, scan planning should produce a +point-in-time scan of the latest snapshot in the table's main branch. + + +Responses must include a valid status + +- When "completed" the planning operation has produced plan tasks and + file scan tasks that must be returned in the response (not fetched + later by calling fetchPlanningResult) + +- When "submitted" the response must include a plan-id used to poll + fetchPlanningResult to fetch the planning result when it is ready + +- When "failed" the response must be a valid error response + +- Status "cancelled" is not a valid status from this endpoint + + +The response for a "completed" planning operation includes two types of +tasks (file scan tasks and plan tasks) and both may be included in the +response. Tasks must not be included for any other response status. + + +Responses that include a plan-id indicate that the service is holding +state or performing work for the client. + + +- Clients should use the plan-id to fetch results from + fetchPlanningResult when the response status is "submitted" + +- Clients should inform the service if planning results are no longer + needed by calling cancelPlanning. Cancellation is not necessary after + fetchScanTasks has been used to fetch scan tasks for each plan task. + operationId: planTableScan + requestBody: +content: + application/json: +schema: + $ref: '#/components/schemas/PlanTableScanRequest' + responses: +200: + $ref: '#/components/responses/PlanTableScanResponse' +400: + $ref: '#/components/responses/BadRequestErrorResponse' +401: + $ref: '#/components/responses/UnauthorizedResponse' +403: + $ref: '#/components/responses/ForbiddenResponse' +404: + description: +Not Found +- NoSuchTableException, the table does not exist +- NoSuchNamespaceException, the namespace does not exist + content: +application/json: + schema: +$ref: '#/components/schemas/IcebergErrorResponse' + examples: +TableDoesNotExist: + $ref: '#/components/examples/NoSuchTableError' +NamespaceDoesNotExist: + $ref: '#/components/examples/NoSuchNamespaceError' +406: + $ref: '#/components/responses/UnsupportedOperationResponse' +419: + $ref: '#/components/responses/AuthenticationTimeoutResponse' +503: + $ref: '#/components/responses/ServiceUnavailableResponse' +5XX: + $ref: '#/components/responses/ServerErrorResponse' + + /v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}: +parameters: + - $ref: '#/components/parameters/prefix' + - $ref: '#/components/parameters/namespace' + - $ref: '#/components/parameters/table' + - $ref: '#/components/parameters/plan-id' + +get: + tags: +- Catalog API + summary: Fetches the result of scan planning for a plan-id + operationId: fetchPlanningResult + description: > +Fetches the result of scan planning for a plan-id. + + +Responses must include a valid status + +- When "completed" the planning operation has produced plan-tasks and + file-scan-tasks that must be returned in the response + +- When "submitted" the planning operation has not completed; the client + should wait to call this endpoint again to fetch a completed response + +- When "failed" the response must be a valid error response + +- When "cancelled" the plan-id is invalid and should be discarded + + +The response for a "
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
rahil-c commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750946862 ## open-api/rest-catalog-open-api.yaml: ## @@ -2774,6 +3062,140 @@ components: additionalProperties: type: string +ScanTasks: + type: object + description: +Scan and planning tasks for server-side scan planning + + +- `plan-tasks` contains opaque units of planning work Review Comment: Will add `>` to the description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
flyrain commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750948304 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). +When scanning a branch, the table schema should be used (false). + type: boolean + default: false +start-snapshot-id: + description: Starting snapshot ID for an incremental scan (exclusive) + type: integer + format: int64 +end-snapshot-id: + description: +Ending snapshot ID for an incremental scan (inclusive). + +Required when start-snapshot-id is specified. + type: integer + format: int64 +stats-fields: + description: +List of fields for which the service should send column stats. + type: array + items: +$ref: '#/components/schemas/FieldName' + +FieldName: + description: +A full field name (including parent field names), such as those passed +in APIs like Java `Schema#findField(String name)`. + +The nested field name follows these rules +- Nested struct fields are named by concatenating field names at each + struct level using dot (`.`) delimiter, e.g. + employer.contact_info.address.zip_code +- Nested fields in a map key are named using the keyword `key`, e.g. + employee_address_map.key.first_name +- Nested fields in a map value are named using the keyword `value`, + e.g. employee_address_map.value.zip_code +- Nested fields in a list are named using the keyword `element`, e.g. + employees.element.first_name + type: string + +FetchScanTasksRequest: + type: object + required: +- plan-task + properties: +plan-task: + $ref: '#/components/schemas/PlanTask' + +PlanTask: + description: +An opaque string provided by the REST server that represents a +unit of work to produce file scan tasks for scan planning. + type: string Review Comment: Makes sense to avoid detailed approach in the spec, but I'd suggest to add more details on why we need a plan task string. ``` This allows clients to fetch tasks in batches, which is useful for handling large result sets efficiently. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Bump arrow to 53 [iceberg-rust]
sdd commented on issue #622: URL: https://github.com/apache/iceberg-rust/issues/622#issuecomment-2339158328 This is blocked on DataFusion. The most recently released version of DataFusion, v41.0.0, depends on arrow-* 52. Once https://github.com/apache/datafusion/pull/12032 is in a released version, we can proceed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: @flyrain All good! To clarify, the distinction in the schema that gets used during time travel is something that got established in the reference implementation but is not actually defined in the spec itself. As for the rationale behind this behavior in the reference implementation please refer to https://github.com/apache/iceberg/pull/9131/files. I don't think I'd spec'ing out which ones clients should use in which situation because the schema resolution behavior is not in the table spec itself. I do think it's probably beneficial to provide some more context as to why this field exists, which is to enable clients to align with the reference implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: @flyrain All good! To clarify, the distinction in the schema that gets used during time travel is something that got established in the reference implementation but is not actually defined in the spec itself. As for the rationale behind this behavior in the reference implementation please refer to https://github.com/apache/iceberg/pull/9131/files. I don't think I'd spec'ing out which ones clients should use in which situation because the schema resolution behavior is not in the table spec itself. I do think it's probably beneficial to provide some more context as to why this field exists, which is to enable clients to align with the reference implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: @flyrain All good! To clarify, the distinction in the schema that gets used during time travel is something that got established in the reference implementation but is not actually defined in the spec itself. As for the rationale behind this behavior in the reference implementation please refer to https://github.com/apache/iceberg/pull/9131/files. I don't think I'd spec'ing out which ones clients should use in which situation because the schema resolution behavior is not defined in the table spec itself. I do think it's probably beneficial to provide some more context as to why this field exists, which is to enable clients to align with the reference implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: @flyrain All good! To clarify, the distinction in the schema that gets used during time travel is a convention that got established in the reference implementation but is not actually defined in the spec itself. As for the rationale behind this behavior in the reference implementation please refer to https://github.com/apache/iceberg/pull/9131/files. I don't think I'd spec out which ones clients should use in which situation because the schema resolution behavior is not defined in the table spec itself. I do think it's probably beneficial to provide some more context as to why this field exists, which is to enable clients to align with the reference implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Bump duckdb from 1.0.0 to 1.1.0 [iceberg-python]
dependabot[bot] opened a new pull request, #1152: URL: https://github.com/apache/iceberg-python/pull/1152 Bumps [duckdb](https://github.com/duckdb/duckdb) from 1.0.0 to 1.1.0. Release notes Sourced from https://github.com/duckdb/duckdb/releases";>duckdb's releases. DuckDB 1.1.0 "Eatoni" This release of DuckDB is named "Eatoni" after Eaton's pintail (Anas Eatoni) from the southern Indian Ocean. Please also refer to the announcement blog post: https://duckdb.org/2024/09/09/announcing-duckdb-110";>https://duckdb.org/2024/09/09/announcing-duckdb-110 What's Changed Add feature changes back in by https://github.com/Mytherin";>@Mytherin in https://redirect.github.com/duckdb/duckdb/pull/11146";>duckdb/duckdb#11146 Make MultiFileReader filename configurable by https://github.com/lnkuiper";>@lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11178";>duckdb/duckdb#11178 [Dev] Fix compilation issues on feature by https://github.com/Tishj";>@Tishj in https://redirect.github.com/duckdb/duckdb/pull/11082";>duckdb/duckdb#11082 add query() and query_table() functions by https://github.com/chrisiou";>@chrisiou in https://redirect.github.com/duckdb/duckdb/pull/10586";>duckdb/duckdb#10586 [Block Size] Move the block allocation size into the block manager by https://github.com/taniabogatsch";>@taniabogatsch in https://redirect.github.com/duckdb/duckdb/pull/11176";>duckdb/duckdb#11176 LIMIT pushdown below PROJECT by https://github.com/jeewonhh";>@jeewonhh in https://redirect.github.com/duckdb/duckdb/pull/2";>duckdb/duckdb#2 BUGFIX: IN () filter with one argument should translate to = filter. by https://github.com/Tmonster";>@Tmonster in https://redirect.github.com/duckdb/duckdb/pull/11473";>duckdb/duckdb#11473 Regression Script should calculate micro benchmark differences with the correct base branch by https://github.com/Tmonster";>@Tmonster in https://redirect.github.com/duckdb/duckdb/pull/11762";>duckdb/duckdb#11762 Pushdown filters on window partitions by https://github.com/Tmonster";>@Tmonster in https://redirect.github.com/duckdb/duckdb/pull/10932";>duckdb/duckdb#10932 Arrow ListView Type by https://github.com/Tishj";>@Tishj in https://redirect.github.com/duckdb/duckdb/pull/10766";>duckdb/duckdb#10766 Add scalar function support to the C API by https://github.com/Mytherin";>@Mytherin in https://redirect.github.com/duckdb/duckdb/pull/11786";>duckdb/duckdb#11786 Add TopN optimization in physical plan mapping by https://github.com/kryonix";>@kryonix in https://redirect.github.com/duckdb/duckdb/pull/11290";>duckdb/duckdb#11290 Join-dependent filter derivation by https://github.com/lnkuiper";>@lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11272";>duckdb/duckdb#11272 Implement ROW_GROUPS_PER_FILE for Parquet by https://github.com/lnkuiper";>@lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11249";>duckdb/duckdb#11249 Prefer Final projected columns on probe side if cardinalities are similar by https://github.com/Tmonster";>@Tmonster in https://redirect.github.com/duckdb/duckdb/pull/11109";>duckdb/duckdb#11109 Propagate unused columns to distinct on by https://github.com/Tmonster";>@Tmonster in https://redirect.github.com/duckdb/duckdb/pull/11006";>duckdb/duckdb#11006 Separate eviction queues by FileBufferType by https://github.com/lnkuiper";>@lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11417";>duckdb/duckdb#11417 Disable false positive for vector size nightly in test by https://github.com/taniabogatsch";>@taniabogatsch in https://redirect.github.com/duckdb/duckdb/pull/11953";>duckdb/duckdb#11953 Rework jemalloc extension by https://github.com/lnkuiper";>@lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/11891";>duckdb/duckdb#11891 Tweak jemalloc config by https://github.com/lnkuiper";>@lnkuiper in https://redirect.github.com/duckdb/duckdb/pull/12034";>duckdb/duckdb#12034 Httpfs test to nightly by https://github.com/carlopi";>@carlopi in https://redirect.github.com/duckdb/duckdb/pull/12196";>duckdb/duckdb#12196 Removed three reinterpret casts and some rewriting by https://github.com/taniabogatsch";>@taniabogatsch in https://redirect.github.com/duckdb/duckdb/pull/12200";>duckdb/duckdb#12200 Begin Profiling Rework to move towards Modularity by https://github.com/maiadegraaf";>@maiadegraaf in https://redirect.github.com/duckdb/duckdb/pull/11101";>duckdb/duckdb#11101 [CLI] Add highlighting + limited auto-complete for shell dot commands by https://github.com/Mytherin";>@Mytherin in https://redirect.github.com/duckdb/duckdb/pull/12201";>duckdb/duckdb#12201 Skip test to fix block size nightly and add more explicit error checking by https://github.com/taniabogatsch";>@taniabogatsch in https://redirect.github.com/duckdb/duckdb/pull/12211";>duckdb/duckdb#12211 Remove BLOCK_ALLOC_SIZE from the column segment files by https://github.c
[PR] Bump pydantic from 2.9.0 to 2.9.1 [iceberg-python]
dependabot[bot] opened a new pull request, #1154: URL: https://github.com/apache/iceberg-python/pull/1154 Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.9.0 to 2.9.1. Release notes Sourced from https://github.com/pydantic/pydantic/releases";>pydantic's releases. v2.9.1 (2024-09-09) What's Changed Fixes Fix Predicate issue in v2.9.0 by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10321";>#10321 Fixing annotated-types bound to >=0.6.0 by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic/pull/10327";>#10327 Turn tzdata install requirement into optional timezone dependency by https://github.com/jakob-keller";>@jakob-keller in https://redirect.github.com/pydantic/pydantic/pull/10331";>#10331 Fix IncExc type alias definition by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/10339";>#10339 Use correct types namespace when building namedtuple core schemas by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/10337";>#10337 Fix evaluation of stringified annotations during namespace inspection by https://github.com/Viicos";>@Viicos in https://redirect.github.com/pydantic/pydantic/pull/10347";>#10347 Fix tagged union serialization with alias generators by https://github.com/sydney-runkle";>@sydney-runkle in https://redirect.github.com/pydantic/pydantic-core/pull/1442";>pydantic/pydantic-core#1442 Full Changelog: https://github.com/pydantic/pydantic/compare/v2.9.0...v2.9.1";>https://github.com/pydantic/pydantic/compare/v2.9.0...v2.9.1 Commits https://github.com/pydantic/pydantic/commit/ecc5275d01e3d8de15c3641d35eb5151f5778833";>ecc5275 bump https://github.com/pydantic/pydantic/commit/2c61bfda43e67b8308f86c77ae4121f447f134dd";>2c61bfd Fix evaluation of stringified annotations during namespace inspection (https://redirect.github.com/pydantic/pydantic/issues/10347";>#10347) https://github.com/pydantic/pydantic/commit/3d364cbf994bc6676b8419b8ad588d4d49ab2f29";>3d364cb Use correct types namespace when building namedtuple core schemas (https://redirect.github.com/pydantic/pydantic/issues/10337";>#10337) https://github.com/pydantic/pydantic/commit/2746ccba230b47d279ed5aa4e4831bbdba60ad70";>2746ccb Fix IncEx type alias definition (https://redirect.github.com/pydantic/pydantic/issues/10339";>#10339) https://github.com/pydantic/pydantic/commit/b32d4109675316912b99a7f4fc56dcbf2c73840c";>b32d410 Turn tzdata install requirement into optional timezone dependency (https://redirect.github.com/pydantic/pydantic/issues/10331";>#10331) https://github.com/pydantic/pydantic/commit/7d857eb89c4f3c0389f8e12d83f14c89fab75f37";>7d857eb Fixing annotated-types bound (https://redirect.github.com/pydantic/pydantic/issues/10327";>#10327) https://github.com/pydantic/pydantic/commit/07cbe50fa0a7d217d8382f79c43d02201d25a4fe";>07cbe50 Fix Predicate issue in v2.9.0 (https://redirect.github.com/pydantic/pydantic/issues/10321";>#10321) See full diff in https://github.com/pydantic/pydantic/compare/v2.9.0...v2.9.1";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this mino
Re: [PR] Bump `duckdb` to version `1.1.0` [iceberg-python]
kevinjqliu commented on PR #1149: URL: https://github.com/apache/iceberg-python/pull/1149#issuecomment-2339300626 Updated the PR to only change duckdb from v1.0.0 to v1.1.0 in the poetry.lock file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1750965732 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: @flyrain All good! To clarify, the distinction in the schema that gets used during time travel is a convention that got established in the reference implementation but is not actually defined in the spec itself. As for the rationale behind this behavior in the reference implementation please refer to https://github.com/apache/iceberg/pull/9131/files. I don't think I'd spec out which ones clients should use in which situation because the schema resolution behavior is not defined in the table spec itself. I do think it's probably beneficial to provide some more context as to why this field exists, which is to enable clients to align with the reference implementation. Edit: A concrete suggestion on a description that provides some context: ``` This flag is useful when a client wants to read from a branch and use the table schema or time travel to a specific a snapshot and use the snapshot schema (aligning with the reference implementation) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Cache Manifest files [iceberg-python]
corleyma commented on code in PR #787: URL: https://github.com/apache/iceberg-python/pull/787#discussion_r1751060909 ## pyiceberg/table/snapshots.py: ## @@ -228,6 +229,13 @@ def __eq__(self, other: Any) -> bool: ) +@lru_cache Review Comment: is 128 big enough? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Aliyun: Migrate tests to junit5 for aliyun client factory [iceberg]
github-actions[bot] commented on PR #7853: URL: https://github.com/apache/iceberg/pull/7853#issuecomment-2339368891 This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Aliyun: Migrate tests to junit5 for aliyun client factory [iceberg]
github-actions[bot] closed pull request #7853: Aliyun: Migrate tests to junit5 for aliyun client factory URL: https://github.com/apache/iceberg/pull/7853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Convert Evaluator in expression to use a Comparator [iceberg]
github-actions[bot] closed pull request #7883: API: Convert Evaluator in expression to use a Comparator URL: https://github.com/apache/iceberg/pull/7883 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API: Convert Evaluator in expression to use a Comparator [iceberg]
github-actions[bot] commented on PR #7883: URL: https://github.com/apache/iceberg/pull/7883#issuecomment-2339368920 This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg Java Api - S3 Session Token - 403 Forbidden exception [iceberg]
github-actions[bot] commented on issue #8190: URL: https://github.com/apache/iceberg/issues/8190#issuecomment-2339369252 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: check location for conflict before creating table [iceberg]
github-actions[bot] commented on PR #8194: URL: https://github.com/apache/iceberg/pull/8194#issuecomment-2339369273 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Request to add KLL Datasketch and hive ColumnStatisticsObj and as standard blob types to puffin file. [iceberg]
github-actions[bot] commented on issue #8198: URL: https://github.com/apache/iceberg/issues/8198#issuecomment-2339369316 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support partial insert in merge into command [iceberg]
github-actions[bot] commented on issue #8199: URL: https://github.com/apache/iceberg/issues/8199#issuecomment-2339369339 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add KLL Datasketch and Hive ColumnStatisticsObj as standard blo… [iceberg]
github-actions[bot] commented on PR #8202: URL: https://github.com/apache/iceberg/pull/8202#issuecomment-2339369354 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]
dramaticlly commented on code in PR #11086: URL: https://github.com/apache/iceberg/pull/11086#discussion_r1751080154 ## core/src/test/java/org/apache/iceberg/TestSnapshotProducer.java: ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +import org.junit.jupiter.api.Test; + +public class TestSnapshotProducer { + @Test + public void testManifestFileGroupSize() { +assertEquals( +1_000, +SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 1_000 /* file count */), +"Must return file count when it is small"); + +assertEquals( +10_000, +SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 10_000 /* file count */), +"Must return file count when it matches the min group size"); + +assertEquals( +10_001, +SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 10_001 /* file count */), +"Must return file count when it is slightly above min group size"); + +assertEquals( +12_500, +SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 12_500 /* file count */), +"Must return file count when remainder is < 50% of min group size"); + +assertEquals( +7_500, +SnapshotProducer.manifestFileGroupSize(4 /* parallelism */, 15_000 /* file count */), +"Must split into 2 groups when remainder is >= 50% of min group size"); + +assertEquals( +16_667, +SnapshotProducer.manifestFileGroupSize(3 /* parallelism */, 50_000 /* file count */), +"Must split into 3 groups when file count is large"); + +assertEquals( +166_667, +SnapshotProducer.manifestFileGroupSize(3 /* parallelism */, 500_000 /* file count */), +"Provided parallelism must be respected when file count is large"); + +assertEquals( +10_000, +SnapshotProducer.manifestFileGroupSize(32 /* parallelism */, 50_000 /* file count */), +"Max parallelism based on file count must be respected"); Review Comment: this read a bit weird, I am thinking maybe `Provided parallelism will be adjusted if not enough files in each group` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] RESTSessionCatalog loadTable incorrect set last access time as metadata last update time [iceberg]
haizhou-zhao opened a new issue, #11103: URL: https://github.com/apache/iceberg/issues/11103 ### Apache Iceberg version 1.6.1 (latest release) ### Query engine Spark ### Please describe the bug 🐞 ## Background When using Spark (or likewise execution engines) on top of REST Catalog to commit a new snapshot, there are 3 critical timestamps: 1. When Spark invokes `SnapshotProducer` to produce a new snapshot - referred to as `snapshotCreationTs`: [ref](https://github.com/apache/iceberg/blob/d17a7f1/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L386) 2. When REST catalog receives the update request from Spark and commit new metadata - referred to as `metadataCommitTs` 3. At the time when commit is done finished, and Spark refreshes table state - referred to as `tableAccessTs`: [ref](https://github.com/apache/iceberg/blob/afda8be/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L64) ## Desired behavior `spark.sql("SELECT * from ${db}.${table}.metadata_log_entries")` means the user is looking for `metadataCommitTs`; while `spark.sql("SELECT * from ${db}.${table}. snapshots")` means the user is looking for `snapshotCreationTs`. ## Issue 1 Traditionally, with Hadoop and Hive Catalog, Spark (using Iceberg client) is responsible for both generating new snapshots and committing the new metadata files. And those catalogs are capable of enforcing `snapshotCreationTs` and `metadataCommitTs` to take on the exact same value for every new snapshot committed. However, with REST Catalog, because Spark only controls new snapshot generation while REST Catalog Server controls metadata commit, based on REST Catalog implementation (which is not controllable by this repo), `snapshotCreationTs` and `metadataCommitTs` may or may not be the same. The current implementation in some cases assumes that the two timestamp takes on exact same value. For example, this integration test: [ref](https://github.com/apache/iceberg/blob/2240154/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMetadataTables.java#L446). It would be best to clarify whether a valid REST Catalog implementation should always enforce `snapshotCreationTs` and `metadataCommitTs` to take on the same value. For the current reference implementation (RESTCatalogAdapter on top of JdbcCatalog), the answer is positive (they take on the same value). However, should it be (or feasible to be) enforced for all REST Catalog implementations. ## Issue 2 Currently, when loading Table from REST Catalog using LoadTableResponse, the `lastUpdatedMillis` attribute of the metadata (which may be taking the value of `metadataCommitTs` or `snapshotCreationTs` on REST Catalog side based on impl detail) will be incorrectly replaced by `tableAccessTs` ([ref1](https://github.com/apache/iceberg/blob/afda8be/core/src/main/java/org/apache/iceberg/rest/responses/LoadTableResponse.java#L64), [ref2](https://github.com/apache/iceberg/blob/113c6e7/core/src/main/java/org/apache/iceberg/TableMetadata.java#L938)). Because Spark depends on `lastUpdatedMillis` to generate the latest `metadataCommitTs` on `metadata_log_entries` ([ref](https://github.com/apache/iceberg/blob/8a70fe0/core/src/main/java/org/apache/iceberg/MetadataLogEntriesTable.java#L66)), there will always be a wrong time stamp on `metadata_log_entries` if REST Catalog is used. ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Adding RESTCatalog based Spark Smoke Test [iceberg]
haizhou-zhao commented on issue #11079: URL: https://github.com/apache/iceberg/issues/11079#issuecomment-2339395130 Found several issues while attempting to enable integration test based on REST Catalog for Spark. Listing them below as I troubleshoot them one by one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
rahil-c commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1751106398 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: Thanks for your insights@amogh-jahagirdar @flyrain However I actually think the current description for this is pretty straightforward and do not think we need to explain in the spec the reference implementation context around "snapshot schema" . cc @rdblue @danielcweeks if you think we should clarify this further or keep as is. ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: Thanks for your insights @amogh-jahagirdar @flyrain However I actually think the current description for this is pretty straightforward and do not think we need to explain in the spec the reference implementation context around "snapshot schema" . cc @rdblue @danielcweeks if you think we should clarify this further or keep as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]
karuppayya commented on code in PR #11086: URL: https://github.com/apache/iceberg/pull/11086#discussion_r1751070837 ## core/src/main/java/org/apache/iceberg/SnapshotProducer.java: ## @@ -554,6 +562,84 @@ protected boolean cleanupAfterCommit() { return true; } + protected List writeDataManifests(List files, PartitionSpec spec) { +return writeDataManifests(files, null /* inherit data seq */, spec); + } + + protected List writeDataManifests( + List files, Long dataSeq, PartitionSpec spec) { +return writeManifests(files, group -> writeDataFileGroup(group, dataSeq, spec)); + } + + private List writeDataFileGroup( + List files, Long dataSeq, PartitionSpec spec) { +RollingManifestWriter writer = newRollingManifestWriter(spec); + +try (RollingManifestWriter closableWriter = writer) { + if (dataSeq != null) { +files.forEach(file -> closableWriter.add(file, dataSeq)); + } else { +files.forEach(closableWriter::add); + } +} catch (IOException e) { + throw new RuntimeIOException(e, "Failed to write data manifests"); +} + +return writer.toManifestFiles(); + } + + protected List writeDeleteManifests( + List files, PartitionSpec spec) { +return writeManifests(files, group -> writeDeleteFileGroup(group, spec)); + } + + private List writeDeleteFileGroup( + List files, PartitionSpec spec) { +RollingManifestWriter writer = newRollingDeleteManifestWriter(spec); + +try (RollingManifestWriter closableWriter = writer) { + for (DeleteFileHolder file : files) { +if (file.dataSequenceNumber() != null) { + closableWriter.add(file.deleteFile(), file.dataSequenceNumber()); +} else { + closableWriter.add(file.deleteFile()); +} + } +} catch (IOException e) { + throw new RuntimeIOException(e, "Failed to write delete manifests"); +} + +return writer.toManifestFiles(); + } + + private static List writeManifests( + List files, Function, List> writeFunc) { +int groupSize = manifestFileGroupSize(ThreadPools.WORKER_THREAD_POOL_SIZE, files.size()); +List> groups = Lists.partition(files, groupSize); +Queue manifests = Queues.newConcurrentLinkedQueue(); +Tasks.foreach(groups) +.stopOnFailure() +.throwFailureWhenFinished() +.executeWith(ThreadPools.getWorkerPool()) +.run(group -> manifests.addAll(writeFunc.apply(group))); +return ImmutableList.copyOf(manifests); + } + + /** + * Calculates how many files can be processed concurrently depending on the provided parallelism Review Comment: Slight change in behavior from old code 1. I think it is the prior code we would have only one(the last) that is not sized rightly. In this case, we would have more(equal to the parallelism) of not rightly sized manifest. 2. The max number of datafile in any manifest is 10_000 ```private static final int MIN_FILE_GROUP_SIZE = 10_000;``` I guess 1 is ok Should 2 be configurable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Parallelize manifest writing for many new files [iceberg]
karuppayya commented on code in PR #11086: URL: https://github.com/apache/iceberg/pull/11086#discussion_r1751128701 ## core/src/main/java/org/apache/iceberg/SnapshotProducer.java: ## @@ -554,6 +562,84 @@ protected boolean cleanupAfterCommit() { return true; } + protected List writeDataManifests(List files, PartitionSpec spec) { +return writeDataManifests(files, null /* inherit data seq */, spec); + } + + protected List writeDataManifests( + List files, Long dataSeq, PartitionSpec spec) { +return writeManifests(files, group -> writeDataFileGroup(group, dataSeq, spec)); + } + + private List writeDataFileGroup( + List files, Long dataSeq, PartitionSpec spec) { +RollingManifestWriter writer = newRollingManifestWriter(spec); + +try (RollingManifestWriter closableWriter = writer) { + if (dataSeq != null) { +files.forEach(file -> closableWriter.add(file, dataSeq)); + } else { +files.forEach(closableWriter::add); + } +} catch (IOException e) { + throw new RuntimeIOException(e, "Failed to write data manifests"); +} + +return writer.toManifestFiles(); + } + + protected List writeDeleteManifests( + List files, PartitionSpec spec) { +return writeManifests(files, group -> writeDeleteFileGroup(group, spec)); + } + + private List writeDeleteFileGroup( + List files, PartitionSpec spec) { +RollingManifestWriter writer = newRollingDeleteManifestWriter(spec); + +try (RollingManifestWriter closableWriter = writer) { + for (DeleteFileHolder file : files) { +if (file.dataSequenceNumber() != null) { + closableWriter.add(file.deleteFile(), file.dataSequenceNumber()); +} else { + closableWriter.add(file.deleteFile()); +} + } +} catch (IOException e) { + throw new RuntimeIOException(e, "Failed to write delete manifests"); +} + +return writer.toManifestFiles(); + } + + private static List writeManifests( + List files, Function, List> writeFunc) { +int groupSize = manifestFileGroupSize(ThreadPools.WORKER_THREAD_POOL_SIZE, files.size()); +List> groups = Lists.partition(files, groupSize); +Queue manifests = Queues.newConcurrentLinkedQueue(); +Tasks.foreach(groups) +.stopOnFailure() +.throwFailureWhenFinished() Review Comment: Do we want to do any clean up on failures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1751149711 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: It's OK as is, I do think adding that historical context from the reference implementation is valuable for readers because without it the utility of the client side flag is unclear without digging through PRs or discussion threads. Def not a blocker imo, if others think this context is useful I think we can address this in a follow on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1751149711 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: It's OK as is, I do think adding that historical context from the reference implementation is valuable for readers because without it the utility of the client side flag is unclear without digging through PRs or discussion threads. Def not a blocker imo, if others think this context is useful I think we can address this in a follow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Scan Planning Endpoints to open api spec [iceberg]
amogh-jahagirdar commented on code in PR #9695: URL: https://github.com/apache/iceberg/pull/9695#discussion_r1751149711 ## open-api/rest-catalog-open-api.yaml: ## @@ -3647,6 +4080,105 @@ components: type: integer description: "List of equality field IDs" +PlanTableScanRequest: + type: object + properties: +snapshot-id: + description: +Identifier for the snapshot to scan in a point-in-time scan + type: integer + format: int64 +select: + description: List of selected schema fields + type: array + items: +$ref: '#/components/schemas/FieldName' +filter: + description: +Expression used to filter the table data + $ref: '#/components/schemas/Expression' +case-sensitive: + description: Enables case sensitive field matching for filter and select + type: boolean + default: true +use-snapshot-schema: + description: +Whether to use the schema at the time the snapshot was written. + +When time travelling, the snapshot schema should be used (true). Review Comment: It's OK as is, I do think adding that historical context from the reference implementation is valuable for readers because without it the utility of the client side flag is unclear without digging through PRs or discussion threads. Def not a blocker imo since it's largely describing context and not prescribing behavior, if others think this context is useful I think we can address this in a follow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [API + Avro] Add default value APIs and Avro implementation [iceberg]
manuzhang commented on code in PR #9502: URL: https://github.com/apache/iceberg/pull/9502#discussion_r1751206372 ## core/src/test/java/org/apache/iceberg/avro/TestReadDefaultValues.java: ## @@ -0,0 +1,237 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.avro; + +import static org.apache.iceberg.types.Types.NestedField.optional; +import static org.apache.iceberg.types.Types.NestedField.required; + +import java.io.File; +import java.io.IOException; +import java.util.List; +import java.util.Map; +import org.apache.avro.generic.GenericData.Record; +import org.apache.iceberg.Files; +import org.apache.iceberg.Schema; +import org.apache.iceberg.SingleValueParser; +import org.apache.iceberg.data.GenericRecord; +import org.apache.iceberg.io.FileAppender; +import org.apache.iceberg.relocated.com.google.common.collect.Lists; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.types.Type; +import org.apache.iceberg.types.Types; +import org.junit.Assert; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +public class TestReadDefaultValues { + + @Rule public TemporaryFolder temp = new TemporaryFolder(); + + private static final Object[][] typesWithDefaults = + new Object[][] { +{Types.BooleanType.get(), "true"}, +{Types.IntegerType.get(), "1"}, +{Types.LongType.get(), "999"}, +{Types.FloatType.get(), "1.23"}, +{Types.DoubleType.get(), "123.456"}, +{Types.DateType.get(), "\"2007-12-03\""}, +{Types.TimeType.get(), "\"10:15:30\""}, +{Types.TimestampType.withoutZone(), "\"2007-12-03T10:15:30\""}, +{Types.TimestampType.withZone(), "\"2007-12-03T10:15:30+00:00\""}, +{Types.StringType.get(), "\"foo\""}, +{Types.UUIDType.get(), "\"eb26bdb1-a1d8-4aa6-990e-da940875492c\""}, +{Types.FixedType.ofLength(2), "\"111f\""}, +{Types.BinaryType.get(), "\"ff\""}, +{Types.DecimalType.of(9, 4), "\"123.4500\""}, +{Types.DecimalType.of(9, 0), "\"2\""}, +// Avro doesn't support negative scale +// {Types.DecimalType.of(9, -20), "\"2E+20\""}, +{Types.ListType.ofOptional(1, Types.IntegerType.get()), "[1, 2, 3]"}, +{ + Types.MapType.ofOptional(2, 3, Types.IntegerType.get(), Types.StringType.get()), + "{\"keys\": [1, 2], \"values\": [\"foo\", \"bar\"]}" +}, +{ + Types.StructType.of( + required(4, "f1", Types.IntegerType.get()), + optional(5, "f2", Types.StringType.get())), + "{\"4\": 1, \"5\": \"bar\"}" +}, +// deeply nested complex types +{ + Types.ListType.ofOptional( + 6, + Types.StructType.of( + required(7, "f1", Types.IntegerType.get()), + optional(8, "f2", Types.StringType.get(, + "[{\"7\": 1, \"8\": \"bar\"}, {\"7\": 2, \"8\": " + "\"foo\"}]" +}, +{ + Types.MapType.ofOptional( + 9, + 10, + Types.IntegerType.get(), + Types.StructType.of( + required(11, "f1", Types.IntegerType.get()), + optional(12, "f2", Types.StringType.get(, + "{\"keys\": [1, 2], \"values\": [{\"11\": 1, \"12\": \"bar\"}, {\"11\": 2, \"12\": \"foo\"}]}" +}, +{ + Types.StructType.of( + required( + 13, + "f1", + Types.StructType.of( + optional(14, "ff1", Types.IntegerType.get()), + optional(15, "ff2", Types.StringType.get(, + optional( + 16, + "f2", + Types.StructType.of( + optional(17, "ff1", Types.StringType.get()), + optional(18, "ff2", Types.IntegerType.get(), + "{\"13\": {\"14\": 1, \"15\": \"bar\"}, \"16\": {\"17\": \"bar\", \"18\": 1}}" +}, + }; + + @Test + public void testDefaultValueApplied() throws IOException { +for (Object[] typeWith
Re: [PR] feat: Reassign field ids for schema [iceberg-rust]
Xuanwo commented on PR #615: URL: https://github.com/apache/iceberg-rust/pull/615#issuecomment-2339690333 cc @liurenjie1024 would you like to take a look too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Kafka, Spark: Use AssertJ instead of JUnit assertions [iceberg]
nastra merged PR #11102: URL: https://github.com/apache/iceberg/pull/11102 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support lower_bound&&upper_bound for parquet writer [iceberg-rust]
xxchan commented on code in PR #383: URL: https://github.com/apache/iceberg-rust/pull/383#discussion_r1751316492 ## crates/iceberg/src/arrow/schema.rs: ## @@ -35,6 +35,9 @@ use rust_decimal::prelude::ToPrimitive; use std::collections::HashMap; use std::sync::Arc; +/// When iceberg map type convert to Arrow map type, the default map field name is "key_value". +pub(crate) const DEFAULT_MAP_FIELD_NAME: &str = "key_value"; Review Comment: May I ask what's the reason of changing `entries` to `key_value`? :eyes: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Kafka, Spark: Use AssertJ instead of JUnit assertions [iceberg]
findepi commented on PR #11102: URL: https://github.com/apache/iceberg/pull/11102#issuecomment-2339774571 thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support Vended Credentials for Azure Data Lake Store [iceberg-python]
c-thiel commented on issue #1146: URL: https://github.com/apache/iceberg-python/issues/1146#issuecomment-2339797459 @sungwy in my comment as well as in my comment I am using "adls.sas-token" which is exactly what Java and Spark expect: https://github.com/apache/iceberg/blob/4873b4b7534de0bcda2e1e0366ffcf83943dc906/azure/src/main/java/org/apache/iceberg/azure/AzureProperties.java#L33 - and what our catalog provides. We can easily send both from catalog side - but it would be great if we wouldn't have to. Is there a reason for pyiceberg not using the same prefix as java? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org