Re: [PR] Tests: Unify the test catalog setting [iceberg-python]
frankliee commented on code in PR #609: URL: https://github.com/apache/iceberg-python/pull/609#discussion_r1566818851 ## tests/conftest.py: ## @@ -2144,3 +2144,31 @@ def arrow_table_with_only_nulls(pa_schema: "pa.Schema") -> "pa.Table": import pyarrow as pa return pa.Table.from_pylist([{}, {}], schema=pa_schema) + + +@pytest.fixture() +def catalog_rest() -> Catalog: +return load_catalog( +"local", +**{ +"type": "rest", +"uri": "http://localhost:8181";, +"s3.endpoint": "http://localhost:9000";, +"s3.access-key-id": "admin", +"s3.secret-access-key": "password", +}, +) + + +@pytest.fixture() +def catalog_hive() -> Catalog: Review Comment: > @frankliee Thanks for working on this! We also have `session_catalog` and `session_hive_catalog` in `conftest` > > https://github.com/apache/iceberg-python/blob/aa4654368e54bf84933279179519b299b5910493/tests/conftest.py#L1997-L2022 > > > and I think they can be unified with these too (using scope `session`). WDYT? Nice suggestion, I will unify all of them as `catalog_rest` and `catalog_hive` in this PR after a while. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Adds support for Flink 1.19 version [iceberg]
nastra commented on code in PR #10112: URL: https://github.com/apache/iceberg/pull/10112#discussion_r1566850598 ## flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java: ## @@ -391,17 +391,16 @@ public void createTable(ObjectPath tablePath, CatalogBaseTable table, boolean ig + "an iceberg catalog, Please create table with 'connector'='iceberg' property in a non-iceberg catalog or " + "create table without 'connector'='iceberg' related properties in an iceberg table."); } - -createIcebergTable(tablePath, table, ignoreIfExists); +Preconditions.checkArgument(table instanceof ResolvedCatalogTable, "table should be resolved"); Review Comment: is this new code that's required for Flink 1.19? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Adds support for Flink 1.19 version [iceberg]
nastra commented on code in PR #10112: URL: https://github.com/apache/iceberg/pull/10112#discussion_r1566852990 ## flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java: ## @@ -391,17 +391,16 @@ public void createTable(ObjectPath tablePath, CatalogBaseTable table, boolean ig + "an iceberg catalog, Please create table with 'connector'='iceberg' property in a non-iceberg catalog or " + "create table without 'connector'='iceberg' related properties in an iceberg table."); } - -createIcebergTable(tablePath, table, ignoreIfExists); +Preconditions.checkArgument(table instanceof ResolvedCatalogTable, "table should be resolved"); Review Comment: it's not clear to me why there's a diff on `FlinkCatalog` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Adds support for Flink 1.19 version [iceberg]
nastra commented on code in PR #10112: URL: https://github.com/apache/iceberg/pull/10112#discussion_r1566855387 ## flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java: ## @@ -391,17 +391,16 @@ public void createTable(ObjectPath tablePath, CatalogBaseTable table, boolean ig + "an iceberg catalog, Please create table with 'connector'='iceberg' property in a non-iceberg catalog or " + "create table without 'connector'='iceberg' related properties in an iceberg table."); } - -createIcebergTable(tablePath, table, ignoreIfExists); +Preconditions.checkArgument(table instanceof ResolvedCatalogTable, "table should be resolved"); Review Comment: I think the problem is that the diff is still from a previous attempt where changes were done based on Flink 1.16. What about removing Flink 1.16 building from the gradle files but doing the actual remove of Flink 1.16 folders in a separate PR? I think that would avoid these diffs here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Adds support for Flink 1.19 version [iceberg]
nastra commented on PR #10112: URL: https://github.com/apache/iceberg/pull/10112#issuecomment-2058425533 > @nastra: Any comments? I would like to merge this soon, as any merge to Flink code path will make this PR stale, and @rodmeneses needs to recreate the whole PR. > > Thanks, I think there's still an issue as there are a bunch of files/diffs that are because Flink 1.16 is being removed and git detects it as a move (with some additional changes). This can also be seen when looking at the file path, where a Flink 1.16 file is moved to a Flink 1.19 file, while also adding some diffs where it's not clear why the diff is there in the first place. My suggestion would be to do the actual removal of the 1.16 directory as a separate PR in an immediate follow-up. This would mean to skip tests 8 + 9 from the PR description, but it's fine to update gradle files to not build 1.16 anymore. Thoughts on the suggestion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Add `BoundPredicateVisitor` (alternate version) [iceberg-rust]
sdd opened a new pull request, #334: URL: https://github.com/apache/iceberg-rust/pull/334 Alternative implementation for https://github.com/apache/iceberg-rust/pull/320, using per-operator visitor methods -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add property to disable table initialization for JdbcCatalog [iceberg]
nastra commented on PR #10124: URL: https://github.com/apache/iceberg/pull/10124#issuecomment-2058435530 @mrcnc looks like there's a merge conflict. Can you rebase please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]
sdd commented on code in PR #320: URL: https://github.com/apache/iceberg-rust/pull/320#discussion_r1566871505 ## crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs: ## @@ -0,0 +1,317 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use crate::expr::{BoundPredicate, BoundReference, PredicateOperator}; +use crate::spec::Datum; +use crate::Result; +use fnv::FnvHashSet; + +pub(crate) enum OpLiteral<'a> { +Single(&'a Datum), +Set(&'a FnvHashSet), +} + +/// A visitor for [`BoundPredicate`]s. Visits in post-order. +pub trait BoundPredicateVisitor { +/// The return type of this visitor +type T; + +/// Called after an `AlwaysTrue` predicate is visited +fn always_true(&mut self) -> Result; + +/// Called after an `AlwaysFalse` predicate is visited +fn always_false(&mut self) -> Result; + +/// Called after an `And` predicate is visited +fn and(&mut self, lhs: Self::T, rhs: Self::T) -> Result; + +/// Called after an `Or` predicate is visited +fn or(&mut self, lhs: Self::T, rhs: Self::T) -> Result; + +/// Called after a `Not` predicate is visited +fn not(&mut self, inner: Self::T) -> Result; + +/// Called after visiting a UnaryPredicate, BinaryPredicate, +/// or SetPredicate. Passes the predicate's operator in all cases, +/// as well as the term and literals in the case of binary and aet +/// predicates. +fn op( Review Comment: Ok - here's another PR with the original design. I'll let you guys choose which one to merge :-) https://github.com/apache/iceberg-rust/pull/334. Here's how the inclusive projection would end up looking with this design too: https://github.com/apache/iceberg-rust/pull/335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support partial deletes [iceberg-python]
Fokko commented on code in PR #569: URL: https://github.com/apache/iceberg-python/pull/569#discussion_r1566870388 ## pyiceberg/io/pyarrow.py: ## @@ -1912,3 +1920,55 @@ def _get_parquet_writer_kwargs(table_properties: Properties) -> Dict[str, Any]: default=TableProperties.PARQUET_PAGE_ROW_LIMIT_DEFAULT, ), } + + +def _dataframe_to_data_files( Review Comment: I've moved this one to `pyarrow.py` since it is Arrow-specific. This also explains the local imports above to avoid circular imports. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] add `InclusiveProjection` Visitor (alternate version) [iceberg-rust]
sdd opened a new pull request, #335: URL: https://github.com/apache/iceberg-rust/pull/335 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Create iceberg table from existsing parquet files with slightly different schemas (schemas merge is possible). [iceberg-python]
sergun commented on issue #601: URL: https://github.com/apache/iceberg-python/issues/601#issuecomment-2058445375 @kevinjqliu It is strange to me that in PyArrow there is [pa.unify_schemas(...)](https://arrow.apache.org/docs/python/generated/pyarrow.unify_schemas.html) which is able (I double-checked) to unify **nested schemas** (even with type promotions). But there is no "dual" functionality to cast, concat, read or to do something different with corresponding **data** in `pa.Table`. None of e.g. [pa.Table.cast(...)](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.cast), [pa.concat_tables(...)](https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html), [pa.concat_tables(...)](https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html), [pa.parquet.read_table(...)](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html) is working. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Spark: Use 'delete' if RowDelta only has delete files [iceberg]
nastra merged PR #10123: URL: https://github.com/apache/iceberg/pull/10123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Delete using Merge-on-Read sets `OVERWRITE` while `DELETE` is expected [iceberg]
nastra closed issue #10122: Delete using Merge-on-Read sets `OVERWRITE` while `DELETE` is expected URL: https://github.com/apache/iceberg/issues/10122 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Create iceberg table from existsing parquet files with slightly different schemas (schemas merge is possible). [iceberg-python]
sergun commented on issue #601: URL: https://github.com/apache/iceberg-python/issues/601#issuecomment-2058456159 > One thing I wonder is if PyIceberg can handle schema evolution of nested structs. Looks like it can. From https://py.iceberg.apache.org/api/#add-column: ``` with table.update_schema() as update: update.add_column("retries", IntegerType(), "Number of retries to place the bid") # In a struct update.add_column("details.confirmed_by", StringType(), "Name of the exchange") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]
sdd commented on code in PR #320: URL: https://github.com/apache/iceberg-rust/pull/320#discussion_r1566902353 ## crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs: ## @@ -0,0 +1,317 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use crate::expr::{BoundPredicate, BoundReference, PredicateOperator}; +use crate::spec::Datum; +use crate::Result; +use fnv::FnvHashSet; + +pub(crate) enum OpLiteral<'a> { +Single(&'a Datum), +Set(&'a FnvHashSet), +} + +/// A visitor for [`BoundPredicate`]s. Visits in post-order. +pub trait BoundPredicateVisitor { +/// The return type of this visitor +type T; + +/// Called after an `AlwaysTrue` predicate is visited +fn always_true(&mut self) -> Result; + +/// Called after an `AlwaysFalse` predicate is visited +fn always_false(&mut self) -> Result; + +/// Called after an `And` predicate is visited +fn and(&mut self, lhs: Self::T, rhs: Self::T) -> Result; + +/// Called after an `Or` predicate is visited +fn or(&mut self, lhs: Self::T, rhs: Self::T) -> Result; + +/// Called after a `Not` predicate is visited +fn not(&mut self, inner: Self::T) -> Result; + +/// Called after visiting a UnaryPredicate, BinaryPredicate, +/// or SetPredicate. Passes the predicate's operator in all cases, +/// as well as the term and literals in the case of binary and aet +/// predicates. +fn op( Review Comment: My preference is the original design, which is (confusingly) in the "alternative" PRs 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core, Spark: Use 'delete' if RowDelta only has delete files [iceberg]
nastra commented on PR #10123: URL: https://github.com/apache/iceberg/pull/10123#issuecomment-2058468333 thanks for the reviews @amogh-jahagirdar and @aokolnychyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Flink: FlinkFileIO implementation [iceberg]
pvary opened a new pull request, #10151: URL: https://github.com/apache/iceberg/pull/10151 FileIO implementation which uses Flink embedded filesystem implementations. This allows the user to use the Flink pluggable FileSystems: https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/filesystems/overview/ Some advantages: - Same configuration is used for checkpoints and Iceberg tables - Automatic Delegation Token support -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Create iceberg table from existsing parquet files with slightly different schemas (schemas merge is possible). [iceberg-python]
sergun commented on issue #601: URL: https://github.com/apache/iceberg-python/issues/601#issuecomment-2058470428 BTW: Found some explaination why merge of Arrow tables with different schemas is not possible: https://github.com/apache/arrow/issues/35424 The reason looks weired, but yes, as I remeber e.g. Spark dataframes may have columns with duplicated names. Probably it is possible to implement table merge in PyArrow after the check that there are no duplicated column names in each struct and on root level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] flink autoscaler: how set write-parallelism ? [iceberg]
sannaroby commented on issue #10147: URL: https://github.com/apache/iceberg/issues/10147#issuecomment-2058495395 Hi @pvary, thanks for your reply. We're using the HASH distribution mode and this is an extract from our flink job: ``` SingleOutputStreamOperator mainFunction = env.addSource(kinesisSource) .keyBy(inputMessage-> inputMessage.getKey()) .window(TumblingProcessingTimeWindows.of(Time.seconds(10))) .process(new InputMessageToRowProcessFunction(), rowTypeInfo); FlinkSink.forRow(mainFunction, loadedTableSchema) .tableLoader(tableLoader) .table(icebergTable) .equalityFieldColumns(List.of("messageKey", "messageTimestamp")) .distributionMode(DistributionMode.HASH) .append(); ``` The autoscaler didn't change the upstream operator parallelism (the "InputMessageToRowProcessFunction"), because it judged it to be unnecessary. What do you mean for "rebalance step" ? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566934826 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( Review Comment: ```suggestion Assertions.assertThatThrownBy( ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566941616 ## core/src/test/java/org/apache/iceberg/rest/responses/TestListNamespacesResponse.java: ## @@ -83,9 +83,32 @@ public void testBuilderDoesNotCreateInvalidObjects() { .hasMessage("Invalid namespace: null"); } + @Test + public void testWithNullPaginationToken() throws JsonProcessingException { +String jsonWithNullPageToken = + "{\"namespaces\":[[\"accounting\"],[\"tax\"]],\"next-page-token\":null}"; +ListNamespacesResponse response = + ListNamespacesResponse.builder().addAll(NAMESPACES).nextPageToken(null).build(); +assertRoundTripSerializesEquallyFrom(jsonWithNullPageToken, response); +Assertions.assertThat(response.nextPageToken()).isNull(); +Assertions.assertThat(response.namespaces()).isEqualTo(NAMESPACES); + } + + @Test + public void testWithPaginationToken() throws JsonProcessingException { +String pageToken = "token"; +String jsonWithPageToken = + "{\"namespaces\":[[\"accounting\"],[\"tax\"]],\"next-page-token\":\"token\"}"; +ListNamespacesResponse response = + ListNamespacesResponse.builder().addAll(NAMESPACES).nextPageToken(pageToken).build(); +assertRoundTripSerializesEquallyFrom(jsonWithPageToken, response); +Assertions.assertThat(response.nextPageToken()).isNotNull(); Review Comment: ```suggestion Assertions.assertThat(response.nextPageToken()).isEqualTo("token"); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566941917 ## core/src/test/java/org/apache/iceberg/rest/responses/TestListTablesResponse.java: ## @@ -105,9 +105,32 @@ public void testBuilderDoesNotCreateInvalidObjects() { .hasMessage("Invalid table identifier: null"); } + @Test + public void testWithNullPaginationToken() throws JsonProcessingException { +String jsonWithNullPageToken = + "{\"identifiers\":[{\"namespace\":[\"accounting\",\"tax\"],\"name\":\"paid\"}],\"next-page-token\":null}"; +ListTablesResponse response = + ListTablesResponse.builder().addAll(IDENTIFIERS).nextPageToken(null).build(); +assertRoundTripSerializesEquallyFrom(jsonWithNullPageToken, response); +Assertions.assertThat(response.nextPageToken()).isNull(); +Assertions.assertThat(response.identifiers()).isEqualTo(IDENTIFIERS); + } + + @Test + public void testWithPaginationToken() throws JsonProcessingException { +String pageToken = "token"; +String jsonWithPageToken = + "{\"identifiers\":[{\"namespace\":[\"accounting\",\"tax\"],\"name\":\"paid\"}],\"next-page-token\":\"token\"}"; +ListTablesResponse response = + ListTablesResponse.builder().addAll(IDENTIFIERS).nextPageToken(pageToken).build(); +assertRoundTripSerializesEquallyFrom(jsonWithPageToken, response); +Assertions.assertThat(response.nextPageToken()).isNotNull(); Review Comment: ```suggestion Assertions.assertThat(response.nextPageToken()).isEqualTo("token"); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Check table existence to determine which catalog for drop table [iceberg]
wForget commented on code in PR #10128: URL: https://github.com/apache/iceberg/pull/10128#discussion_r1566943822 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java: ## @@ -275,18 +275,20 @@ public Table alterTable(Identifier ident, TableChange... changes) throws NoSuchT @Override public boolean dropTable(Identifier ident) { -// no need to check table existence to determine which catalog to use. if a table doesn't exist Review Comment: > This PR requires tests that would show the exact problem that was faced and how the new logic behaves now. thanks, added new unit test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566944912 ## core/src/test/java/org/apache/iceberg/rest/TestRESTViewCatalog.java: ## @@ -144,6 +151,36 @@ public void closeCatalog() throws Exception { } } + @Test + public void testPaginationForListViews() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); + +int numberOfItems = 100; +String namespaceName = "newdb"; +String tableName = "newtable"; +String viewName = "newview"; + +// create initial namespace +catalog().createNamespace(Namespace.of(namespaceName)); + +// create several views under namespace, based off a table for listing and verify +for (int i = 0; i < numberOfItems; i++) { + TableIdentifier viewIndentifier = TableIdentifier.of(namespaceName, viewName + i); + catalog + .buildView(viewIndentifier) + .withSchema(SCHEMA) + .withDefaultNamespace(viewIndentifier.namespace()) + .withQuery("spark", "select * from " + namespaceName + "." + tableName) Review Comment: ```suggestion .withQuery("spark", "select * from ns.tbl") ``` and then you can remove the `tableName` variable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566933302 ## core/src/test/java/org/apache/iceberg/rest/responses/TestListNamespacesResponse.java: ## @@ -34,7 +34,7 @@ public class TestListNamespacesResponse extends RequestResponseTestBase
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566937039 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( +() -> catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "-1"))) +.isInstanceOf(IllegalArgumentException.class) +.hasMessage("Invalid value for pageSize, must be a positive integer"); + } + + @Test + public void testPaginationForListNamespaces() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); +int numberOfItems = 100; +String namespaceName = "newdb"; + +// create several namespaces for listing and verify +for (int i = 0; i < numberOfItems; i++) { + String nameSpaceName = namespaceName + i; + catalog.createNamespace(Namespace.of(nameSpaceName)); +} + +List results = catalog.listNamespaces(); +assertThat(results).hasSize(numberOfItems); Review Comment: nit: can be inlined ```suggestion assertThat(catalog.listNamespaces()).hasSize(numberOfItems); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566939950 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( +() -> catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "-1"))) +.isInstanceOf(IllegalArgumentException.class) +.hasMessage("Invalid value for pageSize, must be a positive integer"); + } + + @Test + public void testPaginationForListNamespaces() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); +int numberOfItems = 100; +String namespaceName = "newdb"; + +// create several namespaces for listing and verify +for (int i = 0; i < numberOfItems; i++) { + String nameSpaceName = namespaceName + i; + catalog.createNamespace(Namespace.of(nameSpaceName)); +} + +List results = catalog.listNamespaces(); +assertThat(results).hasSize(numberOfItems); + } + + @Test + public void testPaginationForListTables() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); +int numberOfItems = 100; +String namespaceName = "newdb"; +String tableName = "newtable"; +catalog.createNamespace(Namespace.of(namespaceName)); + +// create several tables under namespace for listing and verify +for (int i = 0; i < numberOfItems; i++) { + TableIdentifier tableIdentifier = TableIdentifier.of(namespaceName, tableName + i); + catalog.createTable(tableIdentifier, SCHEMA); +} + +List tables = catalog.listTables(Namespace.of(namespaceName)); +assertThat(tables).hasSize(numberOfItems); Review Comment: ```suggestion assertThat(catalog.listTables(Namespace.of(namespaceName))).hasSize(numberOfItems); ``` ## core/src/test/java/org/apache/iceberg/rest/TestRESTViewCatalog.java: ## @@ -144,6 +151,36 @@ public void closeCatalog() throws Exception { } } + @Test + public void testPaginationForListViews() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); Review Comment: same as above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566939429 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( +() -> catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "-1"))) +.isInstanceOf(IllegalArgumentException.class) +.hasMessage("Invalid value for pageSize, must be a positive integer"); + } + + @Test + public void testPaginationForListNamespaces() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); Review Comment: no need for `Mockito.spy()` here as you're not doing any verifications on the `adapter` in this test (and in the other tests) ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( +() -> catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "-1"))) +.isInstanceOf(IllegalArgumentException.class) +.hasMessage("Invalid value for pageSize, must be a positive integer"); + } + + @Test + public void testPaginationForListNamespaces() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); +int numberOfItems = 100; +String namespaceName = "newdb"; + +// create several namespaces for listing and verify +for (int i = 0; i < numberOfItems; i++) { + String nameSpaceName = namespaceName + i; + catalog.createNamespace(Namespace.of(nameSpaceName)); +} + +List results = catalog.listNamespaces(); +assertThat(results).hasSize(numberOfItems); + } + + @Test + public void testPaginationForListTables() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); Review Comment: same as above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566946785 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -1796,6 +1799,48 @@ public void testCatalogExpiredBearerTokenIsRefreshedWithCredential(String oauth2 any()); } + + @Test + public void testCatalogWithPagaintionTokenIssue() { +//TODO remove this test, Used to highlight issue with namespaces Review Comment: I don't think we should remove this test. We actually want a test that makes sure we get the expected behavior -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Core: Fix JDBC Catalog table commit when migrating from schema V0 to V1 (#101111) [iceberg]
jbonofre opened a new pull request, #10152: URL: https://github.com/apache/iceberg/pull/10152 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix JDBC Catalog table commit when migrating from schema V0 to V1 (#101111) [iceberg]
jbonofre commented on PR #10152: URL: https://github.com/apache/iceberg/pull/10152#issuecomment-2058580254 @nastra this is the backport to the `1.5.x` branch (in preparation for 1.5.1 release). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]
Fokko commented on issue #329: URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2058636547 @marvinlanhenke Sorry for being late to the party here. Appending a file is rather straightforward, but all the conditions must be met. This is the high-level way of appending a file: - Write a Parquet file with the field IDs populated. - Collect the metrics to populate the statistics in the manifest file. We do this in PyIceberg [here](https://github.com/apache/iceberg-python/blob/49ac3a27794fc12cfb67b29502ba92b429396201/pyiceberg/io/pyarrow.py#L1433-L1496). - Write the snapshot following the concept of a fast-append. A normal append will append the new files to an existing manifest, and a fast-append will write a new manifest file with the new entries. This is much easier to implement, since you don't have to worry about [sequence-number inheritance and such](https://iceberg.apache.org/spec/#sequence-number-inheritance). - Rewrite the manifest-list to add the newly created manifest. - Generate a snapshot summary - Update the metadata. When you are using a traditional catalog like Glue and Hive, this can be a bit of work. If you use the Iceberg REST catalog, this is much easier since it is the responsibility of the REST catalog. > calling the writer to write the DataFile > I think this is also what the python implementation does. In Transaction.append, it calls _dataframe_to_data_files to generate DataFiles based on the pa.Table. In [PyIceberg we have `_dataframe_to_data_files`](https://github.com/apache/iceberg-python/blob/49ac3a27794fc12cfb67b29502ba92b429396201/pyiceberg/table/__init__.py#L2683) that writes out the Arrow table to one or more Parquet files. Then we collect all the statistics and return a Datafile that can be appended to the table. I hope in the future that we can push this down to iceberg-rust :) > If any error happens during generating metadata relation info like manifest etc., as the writer already wrote DataFiles, should we go to delete the written DataFiles? Iceberg Java does this best effort. If it fails, it tries to clean it up, but it is always possible that this won't happen (Looking at you OOMs). This is where the maintenance tasks kick in, as @sdd already pointed out. Talking about prioritization: Things can happen in parallel. For example, something simpler like updating table properties will make sure that the commit path is in place. The Snapshot summary generation can be a PR. The same goes for collecting the column metrics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Doubts about the types supported by Iceberg, Not in line with expectations [iceberg]
Fokko commented on issue #10153: URL: https://github.com/apache/iceberg/issues/10153#issuecomment-2058638829 @madeirak Thanks for reaching out here. Which version of Spark are you using? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] flink:FlinkSink support dynamically changed schema [iceberg]
Ruees commented on issue #4190: URL: https://github.com/apache/iceberg/issues/4190#issuecomment-2058647346 > @leichangqing You can refer to the last two commits of my branch https://github.com/lintingbin2009/iceberg/tree/flink-sink-dynamically-change. We have put this part of the code in our own production environment and executed it for half a year, and it seems that there is no problem so far. https://user-images.githubusercontent.com/5699014/224219390-eac0d0e2-1578-44d8-a6d5-6ba97362e02e.png";> Corresponding to 1, 2, 3 in the above picture, you need to prepare: > > 1. There needs to be a broadcast node that can subscribe to your schema changes. > 2. The data processing node can generate RowData according to the latest schema processing data. > 3. This is based on the above code modification so that the iceberg writer node can receive the latest schema and apply it. I tried to pull the Flinksink related modification code for the first commit and added a column at the end using Java API in the map operator, but the result was not successful. Even after inserting the data successfully, the column at the end was still empty -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Doubts about the types supported by Iceberg, Not in line with expectations [iceberg]
madeirak closed issue #10153: Doubts about the types supported by Iceberg, Not in line with expectations URL: https://github.com/apache/iceberg/issues/10153 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Can Iceberg old partition with no data be deleted? [iceberg]
ajantha-bhat commented on issue #10121: URL: https://github.com/apache/iceberg/issues/10121#issuecomment-2058678837 how is the table partitioned? Can you do describe table or show create table to show the partition scheme? Also, was there any partition evolution? Are you using V1 Iceberg table or v2? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567071915 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -1796,6 +1799,48 @@ public void testCatalogExpiredBearerTokenIsRefreshedWithCredential(String oauth2 any()); } + + @Test + public void testCatalogWithPagaintionTokenIssue() { +//TODO remove this test, Used to highlight issue with namespaces +String token = + "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyLCJleHAiOjE5OTk5OTk5OTk5fQ._3k92KJi2NTyTG6V1s2mzJ__GiQtL36DnzsZSkBdYPw"; +Map catalogHeaders = ImmutableMap.of("Authorization", "Bearer " + token); + +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); + +String credential = "catalog:12345"; +Map contextCredentials = +ImmutableMap.of("token", token, "credential", credential); +SessionCatalog.SessionContext context = +new SessionCatalog.SessionContext( +UUID.randomUUID().toString(), "user", contextCredentials, ImmutableMap.of()); + +RESTCatalog catalog = new RESTCatalog(context, (config) -> adapter); +catalog.initialize("prod", ImmutableMap.of(CatalogProperties.URI, "ignored", "token", token)); + +Mockito.verify(adapter) +.execute( +eq(HTTPMethod.GET), +eq("v1/config"), +any(), +any(), +eq(ConfigResponse.class), +eq(catalogHeaders), +any()); + +Mockito.verify(adapter) Review Comment: @rahil-c this fails because you're verifying that the namespaces endpoint has been called but you're actually not calling `catalog.listNamespaces(..)` in this test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567077122 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -1796,6 +1799,48 @@ public void testCatalogExpiredBearerTokenIsRefreshedWithCredential(String oauth2 any()); } + + @Test + public void testCatalogWithPagaintionTokenIssue() { +//TODO remove this test, Used to highlight issue with namespaces Review Comment: alternatively you can also add the mockito verifications on `adapter` to the other tests that you added (`testPaginationForListNamespaces()` / `testPaginationForListTables()` / `testPaginationForListViews()`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566939429 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( +() -> catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "-1"))) +.isInstanceOf(IllegalArgumentException.class) +.hasMessage("Invalid value for pageSize, must be a positive integer"); + } + + @Test + public void testPaginationForListNamespaces() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); Review Comment: no need for `Mockito.spy()` here as you're not doing any verifications on the `adapter` in this test (and in the other tests) ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( +() -> catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "-1"))) +.isInstanceOf(IllegalArgumentException.class) +.hasMessage("Invalid value for pageSize, must be a positive integer"); + } + + @Test + public void testPaginationForListNamespaces() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); +int numberOfItems = 100; +String namespaceName = "newdb"; + +// create several namespaces for listing and verify +for (int i = 0; i < numberOfItems; i++) { + String nameSpaceName = namespaceName + i; + catalog.createNamespace(Namespace.of(nameSpaceName)); +} + +List results = catalog.listNamespaces(); +assertThat(results).hasSize(numberOfItems); + } + + @Test + public void testPaginationForListTables() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); Review Comment: same as above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567077122 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -1796,6 +1799,48 @@ public void testCatalogExpiredBearerTokenIsRefreshedWithCredential(String oauth2 any()); } + + @Test + public void testCatalogWithPagaintionTokenIssue() { +//TODO remove this test, Used to highlight issue with namespaces Review Comment: alternatively you can also add the mockito verifications on `adapter` to the other tests that you added (`testPaginationForListNamespaces()` / `testPaginationForListTables()` / `testPaginationForListViews()`). For `testPaginationForListNamespaces()` this test would do some additional checks after creating all namespaces and listing them. Those verification checks would be: // verify config endpoint was called (which is e.g. what you already have in L1822ff) // verify create namespace was called X times // verify list namespaces was called. Here is where you would verify that the pagination token is properly exchanged via query params -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1566940267 ## core/src/test/java/org/apache/iceberg/rest/TestRESTViewCatalog.java: ## @@ -144,6 +151,36 @@ public void closeCatalog() throws Exception { } } + @Test + public void testPaginationForListViews() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); Review Comment: same as above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
nastra commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567097286 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -278,14 +286,26 @@ public void setConf(Object newConf) { @Override public List listTables(SessionContext context, Namespace ns) { checkNamespaceIsValid(ns); +Map queryParams = Maps.newHashMap(); +List tables = Lists.newArrayList(); +String pageToken = ""; +if (restPageSize != null) { + queryParams.put("pageToken", pageToken); + queryParams.put("pageSize", restPageSize); +} +do { + ListTablesResponse response = + client.get( + paths.tables(ns), + queryParams, + ListTablesResponse.class, + headers(context), + ErrorHandlers.namespaceErrorHandler()); + pageToken = response.nextPageToken(); Review Comment: I've replied in https://github.com/apache/iceberg/pull/9782/files#r1567071915 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Remove extraneous trailing slash in table location [iceberg-python]
Vitalii0-o commented on issue #606: URL: https://github.com/apache/iceberg-python/issues/606#issuecomment-2058724052 Traceback (most recent call last): File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 242, in _wrap_gen return (yield from f(self, *args, **kwargs)) ^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/destinations/impl/athena/athena.py", line 296, in execute_query cursor.execute(query_line, db_args) File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyathena/cursor.py", line 108, in execute raise OperationalError(query_execution.state_change_reason) pyathena.error.OperationalError: GENERIC_INTERNAL_ERROR: io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 123; S3 Extended Request ID: 123=; Proxy: null), S3 Extended Request ID: 123= (Bucket: bucket, Key: facebook/123/bronze_facebook_test1/_dlt_pipeline_state/metadata/123.metadata.json) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 699, in sync_destination remote_state = self._restore_state_from_destination() ^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 1420, in _restore_state_from_destination state = load_pipeline_state_from_destination(self.pipeline_name, job_client) File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/state_sync.py", line 139, in load_pipeline_state_from_destination state = client.get_stored_state(pipeline_name) ^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/destinations/job_client_impl.py", line 368, in get_stored_state with self.sql_client.execute_query(query, pipeline_name) as cur: File "/usr/local/lib/python3.11/contextlib.py", line 137, in __enter__ return next(self.gen) ^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 244, in _wrap_gen raise self._make_database_exception(ex) dlt.destinations.exceptions.DatabaseTerminalException: GENERIC_INTERNAL_ERROR: io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 123; S3 Extended Request ID: 123=; Proxy: null), S3 Extended Request ID: 123= (Bucket: bucket, Key: facebook/123/bronze_facebook_test1/_dlt_pipeline_state/metadata/123.metadata.json) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task result = execute_callable(context=context, **execute_callable_kwargs) File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 199, in execute return_value = self.execute_callable() ^^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 216, in execute_callable return self.python_callable(*self.op_args, **self.op_kwargs) ^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/helpers/airflow_helper.py", line 273, in _run for attempt in self.retry_policy.copy( File "/usr/local/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 347, in __iter__ do = self.iter(retry_state=retry_state) ^^ File "/usr/local/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter return fut.result() File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result return self.__get_result() ^^^ File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result raise self._exception File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/helpers/airflow_helper.py", line 283, in _run load_info = task_pipeline.run(
Re: [I] Remove extraneous trailing slash in table location [iceberg-python]
Vitalii0-o commented on issue #606: URL: https://github.com/apache/iceberg-python/issues/606#issuecomment-2058726484 This error occurred to me about a week ago, but there were no changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Can Iceberg old partition with no data be deleted? [iceberg]
madeirak commented on issue #10121: URL: https://github.com/apache/iceberg/issues/10121#issuecomment-2058730627 > how is the table partitioned? Can you do describe table or show create table to show the partition scheme? Also, was there any partition evolution? Are you using V1 Iceberg table or v2? thx for reply, this table has indeed changed partitions. Before, show create table on this table , and get "PARTITIONED BY (old_xxx, old_yyy)", with no data in the table. Then, the partition fields change to "PARTITIONED BY (new_partition_1, new_partition_2)" I know this form is used for data co-existence between new and old partitions,with old partition field = null. but in my case, old partition field has no data in it, I am worried that a long path will affect the efficiency of READ and WRITE, **so I wonder if there is way to delete the old partition path with no data, or the impact of a long path on the efficiency of READ and WRITE**? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Can Iceberg old partition with no data be deleted? [iceberg]
ajantha-bhat commented on issue #10121: URL: https://github.com/apache/iceberg/issues/10121#issuecomment-2058743091 I think Iceberg v1 tables has a concept of void transform and it will keep the dropped partition still as a void after partition evolution. Try changing the table default format version table property to 2 and try the partition evolution freshly with rewrite data files again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Can Iceberg old partition with no data be deleted? [iceberg]
ajantha-bhat commented on issue #10121: URL: https://github.com/apache/iceberg/issues/10121#issuecomment-2058745161 Lastly I think there is no way to delete empty folders in Iceberg. Only files are tracked by Iceberg metadata. So, GC operations only clean up the files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg Spark Extensions conflict with Paimon [iceberg]
ajantha-bhat commented on issue #10143: URL: https://github.com/apache/iceberg/issues/10143#issuecomment-2058746453 can you close this if it is a duplicate of https://github.com/apache/paimon/issues/3212 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]
Fokko commented on code in PR #320: URL: https://github.com/apache/iceberg-rust/pull/320#discussion_r1567138774 ## crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs: ## @@ -0,0 +1,317 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use crate::expr::{BoundPredicate, BoundReference, PredicateOperator}; +use crate::spec::Datum; +use crate::Result; +use fnv::FnvHashSet; + +pub(crate) enum OpLiteral<'a> { +Single(&'a Datum), +Set(&'a FnvHashSet), +} + +/// A visitor for [`BoundPredicate`]s. Visits in post-order. +pub trait BoundPredicateVisitor { +/// The return type of this visitor +type T; + +/// Called after an `AlwaysTrue` predicate is visited +fn always_true(&mut self) -> Result; + +/// Called after an `AlwaysFalse` predicate is visited +fn always_false(&mut self) -> Result; + +/// Called after an `And` predicate is visited +fn and(&mut self, lhs: Self::T, rhs: Self::T) -> Result; + +/// Called after an `Or` predicate is visited +fn or(&mut self, lhs: Self::T, rhs: Self::T) -> Result; + +/// Called after a `Not` predicate is visited +fn not(&mut self, inner: Self::T) -> Result; + +/// Called after visiting a UnaryPredicate, BinaryPredicate, +/// or SetPredicate. Passes the predicate's operator in all cases, +/// as well as the term and literals in the case of binary and aet +/// predicates. +fn op( Review Comment: I'm not sold on the iterator idea either, it looks very fancy, but iterating over a set feels awkward to me. > If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke. Well, I'm not the one to decide here, just weighing in my opinion :D I have a strong preference for the original approach as suggested in https://github.com/apache/iceberg-rust/pull/334 I think this leverages the visitor as much as possible, and also makes sure that all the operations are covered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flaky spark-3.3-scala-2.13-java-8 tests due to `No space left on device` [iceberg]
manuzhang commented on issue #10040: URL: https://github.com/apache/iceberg/issues/10040#issuecomment-2058792129 Can this be related to https://issues.apache.org/jira/browse/SPARK-39195? @aokolnychyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Can Iceberg old partition with no data be deleted? [iceberg]
madeirak commented on issue #10121: URL: https://github.com/apache/iceberg/issues/10121#issuecomment-2058796187 > Lastly I think there is no way to delete empty folders in Iceberg. Only files are tracked by Iceberg metadata. So, GC operations only clean up the files. thx set version of table to 2, it does fix my problem😘 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Can Iceberg old partition with no data be deleted? [iceberg]
madeirak closed issue #10121: Can Iceberg old partition with no data be deleted? URL: https://github.com/apache/iceberg/issues/10121 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Tests: Unify the test catalog setting [iceberg-python]
frankliee commented on code in PR #609: URL: https://github.com/apache/iceberg-python/pull/609#discussion_r1566818851 ## tests/conftest.py: ## @@ -2144,3 +2144,31 @@ def arrow_table_with_only_nulls(pa_schema: "pa.Schema") -> "pa.Table": import pyarrow as pa return pa.Table.from_pylist([{}, {}], schema=pa_schema) + + +@pytest.fixture() +def catalog_rest() -> Catalog: +return load_catalog( +"local", +**{ +"type": "rest", +"uri": "http://localhost:8181";, +"s3.endpoint": "http://localhost:9000";, +"s3.access-key-id": "admin", +"s3.secret-access-key": "password", +}, +) + + +@pytest.fixture() +def catalog_hive() -> Catalog: Review Comment: > @frankliee Thanks for working on this! We also have `session_catalog` and `session_hive_catalog` in `conftest` > > https://github.com/apache/iceberg-python/blob/aa4654368e54bf84933279179519b299b5910493/tests/conftest.py#L1997-L2022 > > > and I think they can be unified with these too (using scope `session`). WDYT? Nice suggestion, I will unify all of them as `session_catalog` and `session_hive_catalog` in this PR after a while. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Tests: Unify the test catalog setting [iceberg-python]
frankliee commented on code in PR #609: URL: https://github.com/apache/iceberg-python/pull/609#discussion_r1566818851 ## tests/conftest.py: ## @@ -2144,3 +2144,31 @@ def arrow_table_with_only_nulls(pa_schema: "pa.Schema") -> "pa.Table": import pyarrow as pa return pa.Table.from_pylist([{}, {}], schema=pa_schema) + + +@pytest.fixture() +def catalog_rest() -> Catalog: +return load_catalog( +"local", +**{ +"type": "rest", +"uri": "http://localhost:8181";, +"s3.endpoint": "http://localhost:9000";, +"s3.access-key-id": "admin", +"s3.secret-access-key": "password", +}, +) + + +@pytest.fixture() +def catalog_hive() -> Catalog: Review Comment: > @frankliee Thanks for working on this! We also have `session_catalog` and `session_hive_catalog` in `conftest` > > https://github.com/apache/iceberg-python/blob/aa4654368e54bf84933279179519b299b5910493/tests/conftest.py#L1997-L2022 > > > and I think they can be unified with these too (using scope `session`). WDYT? Nice suggestion, I have unified all of them as `session_catalog` and `session_hive_catalog` in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: turn off the stats gathering when iceberg.hive.keep.stats is false [iceberg]
deniskuzZ commented on PR #10148: URL: https://github.com/apache/iceberg/pull/10148#issuecomment-2058916040 AFAIK, autogater doesn't even work in Hive. After some operations like insert, we issue an extra stats update task that persists column stats changes either to the HMS or puffin file ("hive.iceberg.stats.source", "iceberg") -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] spark.table() raises warn: Unclosed S3FileIO instance in HadoopTableOperations [iceberg]
KingLommel commented on issue #10145: URL: https://github.com/apache/iceberg/issues/10145#issuecomment-2058941436 Thanks @ajantha-bhat. As soon as I have time to dive deep into this I will give it a try! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]
liurenjie1024 commented on code in PR #320: URL: https://github.com/apache/iceberg-rust/pull/320#discussion_r1567303582 ## crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs: ## @@ -0,0 +1,317 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use crate::expr::{BoundPredicate, BoundReference, PredicateOperator}; +use crate::spec::Datum; +use crate::Result; +use fnv::FnvHashSet; + +pub(crate) enum OpLiteral<'a> { +Single(&'a Datum), +Set(&'a FnvHashSet), +} + +/// A visitor for [`BoundPredicate`]s. Visits in post-order. +pub trait BoundPredicateVisitor { +/// The return type of this visitor +type T; + +/// Called after an `AlwaysTrue` predicate is visited +fn always_true(&mut self) -> Result; + +/// Called after an `AlwaysFalse` predicate is visited +fn always_false(&mut self) -> Result; + +/// Called after an `And` predicate is visited +fn and(&mut self, lhs: Self::T, rhs: Self::T) -> Result; + +/// Called after an `Or` predicate is visited +fn or(&mut self, lhs: Self::T, rhs: Self::T) -> Result; + +/// Called after a `Not` predicate is visited +fn not(&mut self, inner: Self::T) -> Result; + +/// Called after visiting a UnaryPredicate, BinaryPredicate, +/// or SetPredicate. Passes the predicate's operator in all cases, +/// as well as the term and literals in the case of binary and aet +/// predicates. +fn op( Review Comment: I'm also +1 for the original approach after comparing these two approaches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add `BoundPredicateVisitor` (alternate version) [iceberg-rust]
liurenjie1024 commented on PR #334: URL: https://github.com/apache/iceberg-rust/pull/334#issuecomment-2059023496 I'll merge this given we have reached some consensus [here](https://github.com/apache/iceberg-rust/pull/320#discussion_r1565780931), thanks @Fokko for review, and @sdd for this effort! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add `BoundPredicateVisitor` (alternate version) [iceberg-rust]
liurenjie1024 merged PR #334: URL: https://github.com/apache/iceberg-rust/pull/334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Lazily compute & cache hashCode in CharSequenceWrapper [iceberg]
nastra commented on code in PR #10023: URL: https://github.com/apache/iceberg/pull/10023#discussion_r1567311373 ## api/src/main/java/org/apache/iceberg/util/CharSequenceWrapper.java: ## @@ -29,13 +29,16 @@ public static CharSequenceWrapper wrap(CharSequence seq) { } private CharSequence wrapped; + // lazily computed & cached hashCode + private transient int hashCode = 0; Review Comment: ah good point, thanks for finding that. Instead of making this an `Integer` I actually introduced a boolean flag. The `String' implementation does the same thing so that the hash isn't re-calculated when the hash is actually 0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]
liurenjie1024 closed pull request #320: Add `BoundPredicateVisitor` trait URL: https://github.com/apache/iceberg-rust/pull/320 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]
liurenjie1024 commented on PR #320: URL: https://github.com/apache/iceberg-rust/pull/320#issuecomment-2059024512 Close by #334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Glue Catalog - table operations (3/3) [iceberg-rust]
liurenjie1024 commented on code in PR #314: URL: https://github.com/apache/iceberg-rust/pull/314#discussion_r1565832014 ## crates/catalog/glue/src/utils.rs: ## @@ -151,6 +205,65 @@ pub(crate) fn validate_namespace(namespace: &NamespaceIdent) -> Result { Ok(name) } +/// Get default table location from `Namespace` properties +pub(crate) fn get_default_table_location( +namespace: &Namespace, +table_name: impl AsRef, +warehouse: impl AsRef, +) -> String { +let properties = namespace.properties(); + +let location = match properties.get(LOCATION) { Review Comment: This seems incorrect to me, see [python implementation](https://github.com/apache/iceberg-python/blob/a892309936effa7ec575195ad3be70193e82d704/pyiceberg/catalog/__init__.py#L762). 1. The `LOCATION` property is database location, if it exists, it should be "{LOCATION}/{tablename}" 2. Otherwise, we should use "{warehouse}/{database}.db/{table}" ## crates/catalog/glue/src/catalog.rs: ## @@ -310,31 +327,282 @@ impl Catalog for GlueCatalog { Ok(table_list) } +/// Creates a new table within a specified namespace using the provided +/// table creation settings. +/// +/// # Returns +/// A `Result` wrapping a `Table` object representing the newly created +/// table. +/// +/// # Errors +/// This function may return an error in several cases, including invalid +/// namespace identifiers, failure to determine a default storage location, +/// issues generating or writing table metadata, and errors communicating +/// with the Glue Catalog. async fn create_table( &self, -_namespace: &NamespaceIdent, -_creation: TableCreation, +namespace: &NamespaceIdent, +creation: TableCreation, ) -> Result { -todo!() +let db_name = validate_namespace(namespace)?; +let table_name = creation.name.clone(); + +let location = match &creation.location { +Some(location) => location.clone(), +None => { +let ns = self.get_namespace(namespace).await?; +get_default_table_location(&ns, &table_name, &self.config.warehouse) +} +}; + +let metadata = TableMetadataBuilder::from_table_creation(creation)?.build()?; +let metadata_location = create_metadata_location(&location, 0)?; + +let mut file = self +.file_io +.new_output(&metadata_location)? +.writer() +.await?; +file.write_all(&serde_json::to_vec(&metadata)?).await?; +file.shutdown().await?; + +let glue_table = convert_to_glue_table( +&table_name, +metadata_location.clone(), +&metadata, +metadata.properties(), +None, +)?; + +let builder = self +.client +.0 +.create_table() +.database_name(&db_name) +.table_input(glue_table); +let builder = with_catalog_id!(builder, self.config); + +builder.send().await.map_err(from_aws_sdk_error)?; + +let table = Table::builder() +.file_io(self.file_io()) +.metadata_location(metadata_location) +.metadata(metadata) +.identifier(TableIdent::new(NamespaceIdent::new(db_name), table_name)) +.build(); + +Ok(table) } -async fn load_table(&self, _table: &TableIdent) -> Result { -todo!() +/// Loads a table from the Glue Catalog and constructs a `Table` object +/// based on its metadata. +/// +/// # Returns +/// A `Result` wrapping a `Table` object that represents the loaded table. +/// +/// # Errors +/// This function may return an error in several scenarios, including: +/// - Failure to validate the namespace. +/// - Failure to retrieve the table from the Glue Catalog. +/// - Absence of metadata location information in the table's properties. +/// - Issues reading or deserializing the table's metadata file. +async fn load_table(&self, table: &TableIdent) -> Result { +let db_name = validate_namespace(table.namespace())?; +let table_name = table.name(); + +let builder = self +.client +.0 +.get_table() +.database_name(&db_name) +.name(table_name); +let builder = with_catalog_id!(builder, self.config); + +let glue_table_output = builder.send().await.map_err(from_aws_sdk_error)?; + +match glue_table_output.table() { +None => Err(Error::new( +ErrorKind::Unexpected, +format!( +"'Table' object for database: {} and table: {} does not exist", +db_name, table_name +), +)), +Some(table) => { +let
Re: [I] Remove `unwrap()` in `ManifestListWriter.close()` [iceberg-rust]
liurenjie1024 commented on issue #177: URL: https://github.com/apache/iceberg-rust/issues/177#issuecomment-2059068941 > Is this issue still valid given #185 ? This seems addressed. Apologies for the noise, surfing good first issues for something to get started on. @tabmatfournier Yeah, I think it's closed, thanks for reporting this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Remove `unwrap()` in `ManifestListWriter.close()` [iceberg-rust]
liurenjie1024 closed issue #177: Remove `unwrap()` in `ManifestListWriter.close()` URL: https://github.com/apache/iceberg-rust/issues/177 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Lazily compute & cache hashCode in CharSequenceWrapper [iceberg]
nastra commented on code in PR #10023: URL: https://github.com/apache/iceberg/pull/10023#discussion_r1567311373 ## api/src/main/java/org/apache/iceberg/util/CharSequenceWrapper.java: ## @@ -29,13 +29,16 @@ public static CharSequenceWrapper wrap(CharSequence seq) { } private CharSequence wrapped; + // lazily computed & cached hashCode + private transient int hashCode = 0; Review Comment: ah good point, thanks for finding that. Instead of making this an `Integer` I actually introduced a boolean flag. The `String` implementation does the same thing so that the hash isn't re-calculated when the hash is actually 0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Glue Catalog - table operations (3/3) [iceberg-rust]
marvinlanhenke commented on code in PR #314: URL: https://github.com/apache/iceberg-rust/pull/314#discussion_r1567401176 ## crates/catalog/glue/src/utils.rs: ## @@ -151,6 +205,65 @@ pub(crate) fn validate_namespace(namespace: &NamespaceIdent) -> Result { Ok(name) } +/// Get default table location from `Namespace` properties +pub(crate) fn get_default_table_location( +namespace: &Namespace, +table_name: impl AsRef, +warehouse: impl AsRef, +) -> String { +let properties = namespace.properties(); + +let location = match properties.get(LOCATION) { Review Comment: Yes, you're right. It seems that I missed `{database}.db` which would, in our case, equate to the `db_name` (NamespaceIdent(Vec)). Should be easy to fix. Thanks for spotting this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Glue Catalog - table operations (3/3) [iceberg-rust]
marvinlanhenke commented on code in PR #314: URL: https://github.com/apache/iceberg-rust/pull/314#discussion_r1567401176 ## crates/catalog/glue/src/utils.rs: ## @@ -151,6 +205,65 @@ pub(crate) fn validate_namespace(namespace: &NamespaceIdent) -> Result { Ok(name) } +/// Get default table location from `Namespace` properties +pub(crate) fn get_default_table_location( +namespace: &Namespace, +table_name: impl AsRef, +warehouse: impl AsRef, +) -> String { +let properties = namespace.properties(); + +let location = match properties.get(LOCATION) { Review Comment: Yes, you're right. It seems that I missed `{database}.db` which would, in our case, equate to the `db_name`. Should be easy to fix. Thanks for spotting this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Glue Catalog - table operations (3/3) [iceberg-rust]
marvinlanhenke commented on code in PR #314: URL: https://github.com/apache/iceberg-rust/pull/314#discussion_r1567403891 ## crates/catalog/glue/src/catalog.rs: ## @@ -310,31 +327,282 @@ impl Catalog for GlueCatalog { Ok(table_list) } +/// Creates a new table within a specified namespace using the provided +/// table creation settings. +/// +/// # Returns +/// A `Result` wrapping a `Table` object representing the newly created +/// table. +/// +/// # Errors +/// This function may return an error in several cases, including invalid +/// namespace identifiers, failure to determine a default storage location, +/// issues generating or writing table metadata, and errors communicating +/// with the Glue Catalog. async fn create_table( &self, -_namespace: &NamespaceIdent, -_creation: TableCreation, +namespace: &NamespaceIdent, +creation: TableCreation, ) -> Result { -todo!() +let db_name = validate_namespace(namespace)?; +let table_name = creation.name.clone(); + +let location = match &creation.location { +Some(location) => location.clone(), +None => { +let ns = self.get_namespace(namespace).await?; +get_default_table_location(&ns, &table_name, &self.config.warehouse) +} +}; + +let metadata = TableMetadataBuilder::from_table_creation(creation)?.build()?; +let metadata_location = create_metadata_location(&location, 0)?; + +let mut file = self +.file_io +.new_output(&metadata_location)? +.writer() +.await?; +file.write_all(&serde_json::to_vec(&metadata)?).await?; +file.shutdown().await?; + +let glue_table = convert_to_glue_table( +&table_name, +metadata_location.clone(), +&metadata, +metadata.properties(), +None, +)?; + +let builder = self +.client +.0 +.create_table() +.database_name(&db_name) +.table_input(glue_table); +let builder = with_catalog_id!(builder, self.config); + +builder.send().await.map_err(from_aws_sdk_error)?; + +let table = Table::builder() +.file_io(self.file_io()) +.metadata_location(metadata_location) +.metadata(metadata) +.identifier(TableIdent::new(NamespaceIdent::new(db_name), table_name)) +.build(); + +Ok(table) } -async fn load_table(&self, _table: &TableIdent) -> Result { -todo!() +/// Loads a table from the Glue Catalog and constructs a `Table` object +/// based on its metadata. +/// +/// # Returns +/// A `Result` wrapping a `Table` object that represents the loaded table. +/// +/// # Errors +/// This function may return an error in several scenarios, including: +/// - Failure to validate the namespace. +/// - Failure to retrieve the table from the Glue Catalog. +/// - Absence of metadata location information in the table's properties. +/// - Issues reading or deserializing the table's metadata file. +async fn load_table(&self, table: &TableIdent) -> Result { +let db_name = validate_namespace(table.namespace())?; +let table_name = table.name(); + +let builder = self +.client +.0 +.get_table() +.database_name(&db_name) +.name(table_name); +let builder = with_catalog_id!(builder, self.config); + +let glue_table_output = builder.send().await.map_err(from_aws_sdk_error)?; + +match glue_table_output.table() { +None => Err(Error::new( +ErrorKind::Unexpected, +format!( +"'Table' object for database: {} and table: {} does not exist", Review Comment: Probably wanted to remove both single quotes here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Glue Catalog - table operations (3/3) [iceberg-rust]
marvinlanhenke commented on code in PR #314: URL: https://github.com/apache/iceberg-rust/pull/314#discussion_r1567406955 ## crates/catalog/glue/src/catalog.rs: ## @@ -310,31 +327,282 @@ impl Catalog for GlueCatalog { Ok(table_list) } +/// Creates a new table within a specified namespace using the provided +/// table creation settings. +/// +/// # Returns +/// A `Result` wrapping a `Table` object representing the newly created +/// table. +/// +/// # Errors +/// This function may return an error in several cases, including invalid +/// namespace identifiers, failure to determine a default storage location, +/// issues generating or writing table metadata, and errors communicating +/// with the Glue Catalog. async fn create_table( &self, -_namespace: &NamespaceIdent, -_creation: TableCreation, +namespace: &NamespaceIdent, +creation: TableCreation, ) -> Result { -todo!() +let db_name = validate_namespace(namespace)?; +let table_name = creation.name.clone(); + +let location = match &creation.location { +Some(location) => location.clone(), +None => { +let ns = self.get_namespace(namespace).await?; +get_default_table_location(&ns, &table_name, &self.config.warehouse) +} +}; + +let metadata = TableMetadataBuilder::from_table_creation(creation)?.build()?; +let metadata_location = create_metadata_location(&location, 0)?; + +let mut file = self +.file_io +.new_output(&metadata_location)? +.writer() +.await?; +file.write_all(&serde_json::to_vec(&metadata)?).await?; +file.shutdown().await?; + +let glue_table = convert_to_glue_table( +&table_name, +metadata_location.clone(), +&metadata, +metadata.properties(), +None, +)?; + +let builder = self +.client +.0 +.create_table() +.database_name(&db_name) +.table_input(glue_table); +let builder = with_catalog_id!(builder, self.config); + +builder.send().await.map_err(from_aws_sdk_error)?; + +let table = Table::builder() +.file_io(self.file_io()) +.metadata_location(metadata_location) +.metadata(metadata) +.identifier(TableIdent::new(NamespaceIdent::new(db_name), table_name)) +.build(); + +Ok(table) } -async fn load_table(&self, _table: &TableIdent) -> Result { -todo!() +/// Loads a table from the Glue Catalog and constructs a `Table` object +/// based on its metadata. +/// +/// # Returns +/// A `Result` wrapping a `Table` object that represents the loaded table. +/// +/// # Errors +/// This function may return an error in several scenarios, including: +/// - Failure to validate the namespace. +/// - Failure to retrieve the table from the Glue Catalog. +/// - Absence of metadata location information in the table's properties. +/// - Issues reading or deserializing the table's metadata file. +async fn load_table(&self, table: &TableIdent) -> Result { +let db_name = validate_namespace(table.namespace())?; +let table_name = table.name(); + +let builder = self +.client +.0 +.get_table() +.database_name(&db_name) +.name(table_name); +let builder = with_catalog_id!(builder, self.config); + +let glue_table_output = builder.send().await.map_err(from_aws_sdk_error)?; + +match glue_table_output.table() { +None => Err(Error::new( +ErrorKind::Unexpected, +format!( +"'Table' object for database: {} and table: {} does not exist", +db_name, table_name +), +)), +Some(table) => { +let metadata_location = get_metadata_location(&table.parameters)?; Review Comment: Yes, I kind of 'ignored' this on purpose - same check is missing for the hive catalog as well; I guess we can handle those in separate PRs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Glue Catalog - table operations (3/3) [iceberg-rust]
marvinlanhenke commented on PR #314: URL: https://github.com/apache/iceberg-rust/pull/314#issuecomment-2059160129 > Thanks for @marvinlanhenke for this great pr, it looks great! Sorry for late reply, I have been busy lately. @liurenjie1024 No worries, and thanks for the review. I'll most likely fix those remaining suggestions tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix JDBC Catalog table commit when migrating from schema V0 to V1 (#101111) [iceberg]
nastra merged PR #10152: URL: https://github.com/apache/iceberg/pull/10152 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]
liurenjie1024 commented on issue #329: URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2059202270 I think to implement appending data file, there are two main tasks: 1. Implement transaction api to append data file 2. Implement file writer to write record batches to parquet files, and generate data file structs. Currently there is no design or plan for 1, and @ZENOTME is working on 2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Sql catalog [iceberg-rust]
liurenjie1024 commented on PR #229: URL: https://github.com/apache/iceberg-rust/pull/229#issuecomment-2059212759 cc @JanKaul Is this pr ready for review or you need to do more updates? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]
liurenjie1024 commented on code in PR #295: URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1567468384 ## crates/iceberg/src/arrow/reader.rs: ## @@ -186,4 +216,399 @@ impl ArrowReader { Ok(ProjectionMask::leaves(parquet_schema, indices)) } } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +visit_predicate(&mut collector, predicates).unwrap(); Review Comment: We don't need to do this for every file? ## crates/iceberg/src/arrow/reader.rs: ## @@ -186,4 +216,399 @@ impl ArrowReader { Ok(ProjectionMask::leaves(parquet_schema, indices)) } } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +visit_predicate(&mut collector, predicates).unwrap(); +let column_indices = collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut converter = PredicateConverter { +columns: &column_indices, +projection_mask: ProjectionMask::leaves(parquet_schema, column_indices.clone()), +parquet_schema, +column_map: &field_id_map, +}; +let arrow_predicate = visit_predicate(&mut converter, predicates)?; +Ok(Some(RowFilter::new(vec![arrow_predicate]))) +} else { +Ok(None) +} +} + +/// Build the map of field id to Parquet column index in the schema. +fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> Result> { Review Comment: ```suggestion fn build_field_id_map(parquet_schema: &SchemaDescriptor) -> Result> { ``` I think we don't need `self` parameter? ## crates/iceberg/src/arrow/reader.rs: ## @@ -186,4 +216,399 @@ impl ArrowReader { Ok(ProjectionMask::leaves(parquet_schema, indices)) } } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +visit_predicate(&mut collector, predicates).unwrap(); +let column_indices = collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut converter = PredicateConverter { +columns: &column_indices, +projection_mask: ProjectionMask::leaves(parquet_schema, column_indices.clone()), +parquet_schema, +column_map: &field_id_map, +}; +let arrow_predicate = visit_predicate(&mut converter, predicates)?; +Ok(Some(RowFilter::new(vec![arrow_predicate]))) +} else { +Ok(None) +} +} + +/// Build the map of field id to Parquet column index in the schema. +fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> Result> { +let mut column_map = HashMap::new(); +for (idx, field) in parquet_schema.columns().iter().enumerate() { +let field_type = field.self_type(); +match field_type { +ParquetType::PrimitiveType { basic_info, .. } => { +if !basic_info.has_id() { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column {:?} in schema doesn't have field id", +field_type +), +)); +} +
Re: [PR] Flink: Adds support for Flink 1.19 version [iceberg]
rodmeneses commented on PR #10112: URL: https://github.com/apache/iceberg/pull/10112#issuecomment-2059306340 > > @nastra: Any comments? I would like to merge this soon, as any merge to Flink code path will make this PR stale, and @rodmeneses needs to recreate the whole PR. > > Thanks, > > I think there's still an issue as there are a bunch of files/diffs that are because Flink 1.16 is being removed and git detects it as a move (with some additional changes). This can also be seen when looking at the file path, where a Flink 1.16 file is moved to a Flink 1.19 file, while also adding some diffs where it's not clear why the diff is there in the first place. > > My suggestion would be to do the actual removal of the 1.16 directory as a separate PR in an immediate follow-up. This would mean to skip tests 8 + 9 from the PR description, but it's fine to update gradle files to not build 1.16 anymore. Thoughts on the suggestion? Hi @nastra thanks for your review and comment, > I think there's still an issue as there are a bunch of files/diffs that are because Flink 1.16 is being removed and git detects it as a move (with some additional changes). This is because you are looking at the changes in the whole PR. If you se each of the 4 commits individually, you'll find that everything is making sense I did the approach of not deleting v1.16 and this PR was updated. But, if you see the changes as a whole PR, now it's even worse than before, because we dont see the history properly. Given this, I'd suggest to move forward with deleting v1.16 in this same PR. Thoughts ? @nastra @pvary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Adds support for Flink 1.19 version [iceberg]
pvary commented on PR #10112: URL: https://github.com/apache/iceberg/pull/10112#issuecomment-2059345426 > Why do we remove Flink 1.1.6 in this PR? @manuzhang: This is how we usually do these changes. We support the 3 last version of Flink, so when we add a new version, we remove the old one. Also we do the changes this way, to keep the history of the main directory (in our case 1.18->1.19). Old PRs: - #9211 - #7254 - #6092 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Adds support for Flink 1.19 version [iceberg]
pvary merged PR #10112: URL: https://github.com/apache/iceberg/pull/10112 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Tests: Unify the test catalog setting [iceberg-python]
HonahX merged PR #609: URL: https://github.com/apache/iceberg-python/pull/609 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Adds support for Flink 1.19 version [iceberg]
pvary commented on PR #10112: URL: https://github.com/apache/iceberg/pull/10112#issuecomment-2059391305 Merged to main. Thanks for the PR @rodmeneses and @nastra for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Removes Flink version 1.16 [iceberg]
rodmeneses commented on PR #10154: URL: https://github.com/apache/iceberg/pull/10154#issuecomment-2059418597 cc: @pvary @nastra please take a look. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] JDBC Catalog:Change SQL like escape character from '\\' to '!' [iceberg]
nastra commented on PR #9407: URL: https://github.com/apache/iceberg/pull/9407#issuecomment-2059442673 fixes https://github.com/apache/iceberg/issues/10056 @xuchuanqiu can you rebase this one please to fix the merge conflicts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] JDBC Catalog:Change SQL like escape character from '\\' to '!' [iceberg]
jbonofre commented on code in PR #9407: URL: https://github.com/apache/iceberg/pull/9407#discussion_r1567618729 ## core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java: ## @@ -136,7 +136,7 @@ final class JdbcUtil { + TABLE_NAMESPACE + " = ? OR " + TABLE_NAMESPACE - + " LIKE ? ESCAPE '\\' " + + " LIKE ? ESCAPE '!' " Review Comment: Does it work with any backend like PostgreSQL ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] JDBC Catalog:Change SQL like escape character from '\\' to '!' [iceberg]
jbonofre commented on PR #9407: URL: https://github.com/apache/iceberg/pull/9407#issuecomment-2059448849 If @xuchuanqiu doesn't have the bandwidth, I can help on this one. If it works with MySQL, SQLlite and PostgreSQL, it's a nice fix to include in Iceberg 1.5.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Removes Flink version 1.16 [iceberg]
pvary merged PR #10154: URL: https://github.com/apache/iceberg/pull/10154 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Don't fail to serialize IcebergSourceSplit when there is too many delete files [iceberg]
stevenzwu commented on code in PR #9464: URL: https://github.com/apache/iceberg/pull/9464#discussion_r1567629192 ## flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/SerializerHelper.java: ## @@ -0,0 +1,186 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.flink.source.split; + +import java.io.IOException; +import java.io.Serializable; +import java.io.UTFDataFormatException; +import org.apache.flink.core.memory.DataInputDeserializer; +import org.apache.flink.core.memory.DataOutputSerializer; + +/** + * Helper class to serialize and deserialize strings longer than 65K. The inspiration is mostly + * taken from the class org.apache.flink.core.memory.DataInputSerializer.readUTF and + * org.apache.flink.core.memory.DataOutputSerializer.writeUTF. + */ +public class SerializerHelper implements Serializable { + + /** + * Similar to {@link DataOutputSerializer#writeUTF(String)}. The size is only limited by the + * maximum java array size of the buffer. + * + * @param out the output stream to write the string to. + * @param str the string value to be written. + * @deprecated This method is deprecated because there will be a method within the {@link + * DataOutputSerializer} already which does the same thing, so use that one instead once that + * is released on Flink version 1.20. + * See * https://issues.apache.org/jira/browse/FLINK-34228";>FLINK-34228 * https://github.com/apache/flink/pull/24191";>https://github.com/apache/flink/pull/24191 + */ + @Deprecated + public static void writeLongUTF(DataOutputSerializer out, String str) throws IOException { +int strlen = str.length(); +long utflen = 0; +int c; + +/* use charAt instead of copying String to char array */ +for (int i = 0; i < strlen; i++) { + c = str.charAt(i); + utflen += getUTFBytesSize(c); + + if (utflen > Integer.MAX_VALUE) { +throw new UTFDataFormatException("Encoded string reached maximum length: " + utflen); + } +} +if (utflen > Integer.MAX_VALUE - 4) { + throw new UTFDataFormatException("Encoded string is too long: " + utflen); +} + Review Comment: @pvary you are right. I guess the resize call from the [Flink PR](https://github.com/apache/flink/pull/24191/files) is not necessary then. this is where @javrasya had the question for this thread. but I guess @javrasya 's original comment is not 100% inaccurate. ``` if (utflen > Integer.MAX_VALUE - 4) { throw new UTFDataFormatException("Encoded string is too long: " + utflen); } else if (this.position > this.buffer.length - utflen - 2) { resize((int) utflen + 4); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Removes Flink version 1.16 [iceberg]
pvary commented on PR #10154: URL: https://github.com/apache/iceberg/pull/10154#issuecomment-2059456318 Thanks for the PR @rodmeneses and @nastra for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Don't fail to serialize IcebergSourceSplit when there is too many delete files [iceberg]
stevenzwu commented on code in PR #9464: URL: https://github.com/apache/iceberg/pull/9464#discussion_r1567629965 ## flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/SerializerHelper.java: ## @@ -0,0 +1,200 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.flink.source.split; + +import java.io.IOException; +import java.io.Serializable; +import java.io.UTFDataFormatException; +import org.apache.flink.core.memory.DataInputDeserializer; +import org.apache.flink.core.memory.DataOutputSerializer; + +/** + * Helper class to serialize and deserialize strings longer than 65K. The inspiration is mostly + * taken from the class org.apache.flink.core.memory.DataInputSerializer.readUTF and + * org.apache.flink.core.memory.DataOutputSerializer.writeUTF. + */ +public class SerializerHelper implements Serializable { + + /** + * Similar to {@link DataOutputSerializer#writeUTF(String)}. Except this supports larger payloads + * which is up to max integer value. + * + * @param out the output stream to write the string to. + * @param str the string value to be written. + * @deprecated This method is deprecated because there will be a method within the {@link + * DataOutputSerializer} already which does the same thing, so use that one instead once that + * is released on Flink version 1.20. + * See * https://issues.apache.org/jira/browse/FLINK-34228";>FLINK-34228 * https://github.com/apache/flink/pull/24191";>https://github.com/apache/flink/pull/24191 + */ + @Deprecated + public static void writeLongUTF(DataOutputSerializer out, String str) throws IOException { +int strlen = str.length(); +long utflen = 0; +int c; + +/* use charAt instead of copying String to char array */ +for (int i = 0; i < strlen; i++) { + c = str.charAt(i); + utflen += getUTFBytesSize(c); + + if (utflen > Integer.MAX_VALUE) { +throw new UTFDataFormatException("Encoded string reached maximum length: " + utflen); + } +} +if (utflen > Integer.MAX_VALUE - 4) { Review Comment: nit: Iceberg style leave a blank line after the close of a block (for loop) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Removes Flink version 1.16 [iceberg]
Fokko commented on PR #10154: URL: https://github.com/apache/iceberg/pull/10154#issuecomment-2059523299 Thanks @rodmeneses for working on this. Can you [update the table here](https://github.com/apache/iceberg/blob/main/site/docs/multi-engine-support.md#apache-flink) as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Removes Flink version 1.16 [iceberg]
rodmeneses commented on PR #10154: URL: https://github.com/apache/iceberg/pull/10154#issuecomment-2059543391 @Fokko Thanks for the reminder. Here's the PR: https://github.com/apache/iceberg/pull/10155 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Updates flink versioning information in our docs [iceberg]
Fokko commented on code in PR #10155: URL: https://github.com/apache/iceberg/pull/10155#discussion_r1567700280 ## site/docs/multi-engine-support.md: ## @@ -83,15 +83,16 @@ Users should continuously upgrade their Flink version to stay up-to-date. | Version | Lifecycle Stage | Initial Iceberg Support | Latest Iceberg Support | Latest Runtime Jar | -| --- | --- | --- ||--| +| --- |-|-||--| | 1.11| End of Life | 0.9.0 | 0.12.1 | [iceberg-flink-runtime](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime/0.12.1/iceberg-flink-runtime-0.12.1.jar) | | 1.12| End of Life | 0.12.0 | 0.13.1 | [iceberg-flink-runtime-1.12](https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.12/0.13.2/iceberg-flink-runtime-1.12-0.13.2.jar) [3] | | 1.13| End of Life | 0.13.0 | 1.0.0 | [iceberg-flink-runtime-1.13](https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.13/1.2.0/iceberg-flink-runtime-1.13-1.0.0.jar) | | 1.14| End of Life | 0.13.0 | 1.2.0 | [iceberg-flink-runtime-1.14](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.14/1.2.0/iceberg-flink-runtime-1.14-1.2.0.jar) | | 1.15| End of Life | 0.14.0 | 1.4.3 | [iceberg-flink-runtime-1.15](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.15/1.4.3/iceberg-flink-runtime-1.15-1.4.3.jar) | -| 1.16| Deprecated | 1.1.0 | {{ icebergVersion }} | [iceberg-flink-runtime-1.16](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.16/{{ icebergVersion }}/iceberg-flink-runtime-1.16-{{ icebergVersion }}.jar) | -| 1.17| Maintained | 1.3.0 | {{ icebergVersion }} | [iceberg-flink-runtime-1.17](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.17/{{ icebergVersion }}/iceberg-flink-runtime-1.17-{{ icebergVersion }}.jar) | +| 1.16| End of Life | 1.1.0 | {{ icebergVersion }} | [iceberg-flink-runtime-1.16](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.16/{{ icebergVersion }}/iceberg-flink-runtime-1.16-{{ icebergVersion }}.jar) | Review Comment: ```suggestion | 1.16| End of Life | 1.1.0 | 1.5.0 | [iceberg-flink-runtime-1.16](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-flink-runtime-1.16/1.5.0/iceberg-flink-runtime-1.16-1.5.0.jar) | ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Iceberg Spark streaming skips rows of data [iceberg]
cccs-jc opened a new issue, #10156: URL: https://github.com/apache/iceberg/issues/10156 ### Apache Iceberg version 1.5.0 (latest release) ### Query engine Spark ### Please describe the bug 🐞 When using spark readStream the option `stream-from-timestamp` is used to position the read at the specified timestamp. The query below uses the current time to read from the head of the queue. ```python ts = int(time.time() * 1000) df = spark.readStream.format("iceberg") .option("streaming-skip-delete-snapshots", True) .option("streaming-skip-overwrite-snapshots", True) .option("streaming-max-files-per-micro-batch", max_files) .option("streaming-max-rows-per-micro-batch", max_rows) .option("stream-from-timestamp", ts) .load(source_table) ``` You can kill your streaming job and wait 10 minutes. Then start it again. The readStream will load the checkpointed offset from disk and is supposed to read from that offset. However, there is a bug that cause it to skip the commits that occurred in that 10 minutes and instead the readStream reads from the latest commit. I can work around this bug by not specifying the `stream-from-timestamp` if the query uses the checkpointed offset. ```python ts = int(time.time() * 1000) use_ts = checkpoint_dir_exists(checkpoint_location) == False df = ( get_spark().readStream.format("iceberg") .option("streaming-skip-delete-snapshots", True) .option("streaming-skip-overwrite-snapshots", True) .option("streaming-max-files-per-micro-batch", max_files) .option("streaming-max-rows-per-micro-batch", max_rows) ) if use_ts: df = df.option("stream-from-timestamp", ts) df = df.load(source_table) ``` But this is error prone. As a user I expect the readStream to continue from the last checkpointed offset. I suspect the issue might be here. https://github.com/apache/iceberg/blob/fc5b2b336c774b0b8b032f7d87a1fb21e76b3f20/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L326 This seems to short-circuit the saved `startOffset` by checking of the last snapshot in the table is older than the requested `stream-from-timestamp`. @singhpk234 I have not stepped through the code to be sure. Is my theory possible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]
viirya commented on code in PR #295: URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1567725081 ## crates/iceberg/src/arrow/reader.rs: ## @@ -186,4 +216,399 @@ impl ArrowReader { Ok(ProjectionMask::leaves(parquet_schema, indices)) } } + +fn get_row_filter(&self, parquet_schema: &SchemaDescriptor) -> Result> { +if let Some(predicates) = &self.predicates { +let field_id_map = self.build_field_id_map(parquet_schema)?; + +// Collect Parquet column indices from field ids +let mut collector = CollectFieldIdVisitor { field_ids: vec![] }; +visit_predicate(&mut collector, predicates).unwrap(); +let column_indices = collector +.field_ids +.iter() +.map(|field_id| { +field_id_map.get(field_id).cloned().ok_or_else(|| { +Error::new(ErrorKind::DataInvalid, "Field id not found in schema") +}) +}) +.collect::>>()?; + +// Convert BoundPredicates to ArrowPredicates +let mut converter = PredicateConverter { +columns: &column_indices, +projection_mask: ProjectionMask::leaves(parquet_schema, column_indices.clone()), +parquet_schema, +column_map: &field_id_map, +}; +let arrow_predicate = visit_predicate(&mut converter, predicates)?; +Ok(Some(RowFilter::new(vec![arrow_predicate]))) +} else { +Ok(None) +} +} + +/// Build the map of field id to Parquet column index in the schema. +fn build_field_id_map(&self, parquet_schema: &SchemaDescriptor) -> Result> { +let mut column_map = HashMap::new(); +for (idx, field) in parquet_schema.columns().iter().enumerate() { +let field_type = field.self_type(); +match field_type { +ParquetType::PrimitiveType { basic_info, .. } => { +if !basic_info.has_id() { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column {:?} in schema doesn't have field id", +field_type +), +)); +} +column_map.insert(basic_info.id(), idx); +} +ParquetType::GroupType { .. } => { +return Err(Error::new( +ErrorKind::DataInvalid, +format!( +"Leave column in schema should be primitive type but got {:?}", +field_type +), +)); +} +}; +} + +Ok(column_map) +} +} + +/// A visitor to collect field ids from bound predicates. +struct CollectFieldIdVisitor { +field_ids: Vec, +} + +impl BoundPredicateVisitor for CollectFieldIdVisitor { Review Comment: Okay, let me take a look, not sure if it is completely suitable for the usage here, especially for `PredicateConverter`. But let me try it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
rahil-c commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567730841 ## core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java: ## @@ -278,14 +286,26 @@ public void setConf(Object newConf) { @Override public List listTables(SessionContext context, Namespace ns) { checkNamespaceIsValid(ns); +Map queryParams = Maps.newHashMap(); +List tables = Lists.newArrayList(); +String pageToken = ""; +if (restPageSize != null) { + queryParams.put("pageToken", pageToken); + queryParams.put("pageSize", restPageSize); +} +do { + ListTablesResponse response = + client.get( + paths.tables(ns), + queryParams, + ListTablesResponse.class, + headers(context), + ErrorHandlers.namespaceErrorHandler()); + pageToken = response.nextPageToken(); Review Comment: Will take a look thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
rahil-c commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567734951 ## core/src/test/java/org/apache/iceberg/rest/responses/TestListNamespacesResponse.java: ## @@ -34,7 +34,7 @@ public class TestListNamespacesResponse extends RequestResponseTestBase
Re: [PR] add `InclusiveProjection` Visitor [iceberg-rust]
sdd closed pull request #321: add `InclusiveProjection` Visitor URL: https://github.com/apache/iceberg-rust/pull/321 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] add `InclusiveProjection` Visitor [iceberg-rust]
sdd commented on PR #321: URL: https://github.com/apache/iceberg-rust/pull/321#issuecomment-2059636629 Closing this one in favour of the alternate PR based on the BoundPredicateVisitor design that got merged (https://github.com/apache/iceberg-rust/pull/335) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
rahil-c commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567745943 ## core/src/test/java/org/apache/iceberg/rest/responses/TestListTablesResponse.java: ## @@ -36,7 +36,7 @@ public class TestListTablesResponse extends RequestResponseTestBase
Re: [PR] Add Pagination To List Apis [iceberg]
rahil-c commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567748849 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( Review Comment: will fix ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( +() -> catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "-1"))) +.isInstanceOf(IllegalArgumentException.class) +.hasMessage("Invalid value for pageSize, must be a positive integer"); + } + + @Test + public void testPaginationForListNamespaces() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); +int numberOfItems = 100; +String namespaceName = "newdb"; + +// create several namespaces for listing and verify +for (int i = 0; i < numberOfItems; i++) { + String nameSpaceName = namespaceName + i; + catalog.createNamespace(Namespace.of(nameSpaceName)); +} + +List results = catalog.listNamespaces(); +assertThat(results).hasSize(numberOfItems); Review Comment: will fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Pagination To List Apis [iceberg]
rahil-c commented on code in PR #9782: URL: https://github.com/apache/iceberg/pull/9782#discussion_r1567755548 ## core/src/test/java/org/apache/iceberg/rest/TestRESTCatalog.java: ## @@ -2329,6 +2374,57 @@ public void multipleDiffsAgainstMultipleTablesLastFails() { assertThat(schema2.columns()).hasSize(1); } + @Test + public void testInvalidRestPageSize() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +org.assertj.core.api.Assertions.assertThatThrownBy( +() -> catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "-1"))) +.isInstanceOf(IllegalArgumentException.class) +.hasMessage("Invalid value for pageSize, must be a positive integer"); + } + + @Test + public void testPaginationForListNamespaces() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); +int numberOfItems = 100; +String namespaceName = "newdb"; + +// create several namespaces for listing and verify +for (int i = 0; i < numberOfItems; i++) { + String nameSpaceName = namespaceName + i; + catalog.createNamespace(Namespace.of(nameSpaceName)); +} + +List results = catalog.listNamespaces(); +assertThat(results).hasSize(numberOfItems); + } + + @Test + public void testPaginationForListTables() { +RESTCatalogAdapter adapter = Mockito.spy(new RESTCatalogAdapter(backendCatalog)); +RESTCatalog catalog = +new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) -> adapter); +catalog.initialize("test", ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10")); +int numberOfItems = 100; +String namespaceName = "newdb"; +String tableName = "newtable"; +catalog.createNamespace(Namespace.of(namespaceName)); + +// create several tables under namespace for listing and verify +for (int i = 0; i < numberOfItems; i++) { + TableIdentifier tableIdentifier = TableIdentifier.of(namespaceName, tableName + i); + catalog.createTable(tableIdentifier, SCHEMA); +} + +List tables = catalog.listTables(Namespace.of(namespaceName)); +assertThat(tables).hasSize(numberOfItems); Review Comment: will fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org