[PR] Revert "feat: Introduce snapshot summary properties (#1336)" [iceberg-rust]
liurenjie1024 opened a new pull request, #1390: URL: https://github.com/apache/iceberg-rust/pull/1390 ## Which issue does this PR close? This reverts commit 8bb9a882c7e5f6cfd2a9762136e874421cd7fb59. See https://github.com/apache/iceberg-rust/pull/1336#issuecomment-2915710134 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Add a method in TableCommit to apply changes to TableMetadata to create a new TableMetadata [iceberg-rust]
liurenjie1024 commented on issue #1386: URL: https://github.com/apache/iceberg-rust/issues/1386#issuecomment-2915785708 cc @CTTY -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Change `FieldSummary` `{upper,lower}_bound` to `ByteBuf` [iceberg-rust]
liurenjie1024 commented on code in PR #1369: URL: https://github.com/apache/iceberg-rust/pull/1369#discussion_r2111508289 ## crates/iceberg/src/expr/visitors/manifest_evaluator.rs: ## @@ -154,9 +155,19 @@ impl BoundPredicateVisitor for ManifestFilterVisitor<'_> { _predicate: &BoundPredicate, ) -> crate::Result { let field = self.field_summary_for_reference(reference); + match &field.lower_bound { -Some(bound) if datum <= bound => ROWS_CANNOT_MATCH, -Some(_) => ROWS_MIGHT_MATCH, +Some(bound_bytes) => { +let bound = ManifestFilterVisitor::bytes_to_datum( Review Comment: The cost dependends on the type, but I think it's fine for now since it's not supposed to be quite expensive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Change `FieldSummary` `{upper,lower}_bound` to `ByteBuf` [iceberg-rust]
liurenjie1024 merged PR #1369: URL: https://github.com/apache/iceberg-rust/pull/1369 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add retryable property for Error [iceberg-rust]
liurenjie1024 commented on code in PR #1383: URL: https://github.com/apache/iceberg-rust/pull/1383#discussion_r2111516594 ## crates/iceberg/src/error.rs: ## @@ -134,6 +134,8 @@ pub struct Error { source: Option, backtrace: Backtrace, + +retryable: bool, Review Comment: This is quite confusing. Whether it's retryable should be determined by consumer by `ErrorKind`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Migrate Flink TableSchema for IcebergSource [iceberg]
mxm commented on PR #13072: URL: https://github.com/apache/iceberg/pull/13072#issuecomment-2915875085 What are the obstacles for completely getting rid of TableSchema? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Feature request: support to update snapshot summary [iceberg-rust]
Xuanwo closed issue #1329: Feature request: support to update snapshot summary URL: https://github.com/apache/iceberg-rust/issues/1329 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Throw exception for when listing a non-existing namespace [iceberg]
pvary commented on code in PR #13130: URL: https://github.com/apache/iceberg/pull/13130#discussion_r2110933234 ## hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCatalog.java: ## @@ -1209,11 +1208,4 @@ public void testDatabaseLocationWithSlashInWarehouseDir() { assertThat(database.getLocationUri()).isEqualTo("s3://bucket/database.db"); } - - @Test - @Override - @Disabled("Hive currently returns an empty list instead of throwing a NoSuchNamespaceException") - public void testListNonExistingNamespace() { -super.testListNonExistingNamespace(); - } Review Comment: @stevenzwu: Since you'll be the next release manager, do you know what needs to be done to add a note about the change to the release notes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Running MERGE INTO with more than one WHEN condition fails if the number of columns in the target table is > 321 [iceberg]
andreaschiappacasse commented on issue #10294: URL: https://github.com/apache/iceberg/issues/10294#issuecomment-2915516882 +1 on the issue since is still a problem in our workloads. @nastra would you think is it possible to re-open it? Even if trino solved the upstream issue, it seems that on AWS via Athena the problem still persists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]
pvary commented on code in PR #12254: URL: https://github.com/apache/iceberg/pull/12254#discussion_r2111329707 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java: ## @@ -107,9 +115,12 @@ public abstract class TestRemoveOrphanFilesAction extends TestBase { protected Map properties; @Parameter private int formatVersion; - @Parameters(name = "formatVersion = {0}") + @Parameter(index = 1) + private boolean usePrefixListing; + + @Parameters(name = "formatVersion = {0}, usePrefixListing = {1}") protected static List parameters() { -return Arrays.asList(2, 3); +return Arrays.asList(new Object[] {2, true}, new Object[] {2, false}, new Object[] {3, false}); Review Comment: I believe @RussellSpitzer thought about something like this: ``` return Arrays.stream(TestHelpers.ALL_VERSIONS) .filter(version -> version > 1) .flatMap(version -> Stream.of( new Object[]{version, true}, new Object[]{version, false} )) .collect(Collectors.toList()); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Running MERGE INTO with more than one WHEN condition fails if the number of columns in the target table is > 321 [iceberg]
andreaschiappacasse opened a new issue, #10294: URL: https://github.com/apache/iceberg/issues/10294 ### Apache Iceberg version None ### Query engine Athena (engine v3) ### Please describe the bug 🐞 Hello everyone, today my team incurred in a very strange bug using Iceberg via Athena. I'll descrive the steps we used to reproduce the error below: **1. We create an iceberg table with an "id" column and 321 other columns with random strings - in the example below we use awsrangler to create the table, but the same happens when the table is created using Athena directly.** ``` python import awswrangler as wr import pandas as pd import random, string NUM_COLS=322 def get_random_string(length): letters = string.ascii_lowercase result_str = ''.join(random.choice(letters) for i in range(length)) return result_str columns = ['id']+[get_random_string(5) for i in range(NUM_COLS-1) ] data = pd.DataFrame(data=[columns], columns=columns) wr.athena.to_iceberg( data, workgroup="my-workgroup", database="my_database", table="iceberg_limits_322", table_location="s3://my_bucket/iceberg_limits", ) ``` **2. we then run the following query in athena to insert a random value** ``` sql MERGE INTO my_database.iceberg_limits_322 as existing using ( SELECT 'something' as id ) as new on existing.id = new.id WHEN NOT MATCHED THEN INSERT (id) VALUES (new.id) WHEN MATCHED THEN DELETE ``` **3. which results in the error:** `[ErrorCode: INTERNAL_ERROR_QUERY_ENGINE] Amazon Athena experienced an internal error while executing this query. Please contact AWS support for further assistance. You will not be charged for this query. We apologize for the inconvenience.` Notice that the error **only occurs when multiple WHEN are used in the MERGE INTO query!** - in case one WHEN is used (just to insert or to delete records) everything works fine, and the table can be used normally. We can replicate this behaviour on multiple AWS accounts and with different tables/databases/s3 locations. After trying with different number of columns we consistently found that 321 is the maximum limit for the number of columns of the table. Everything works fine below this threshold. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Introduce snapshot summary properties [iceberg-rust]
liurenjie1024 commented on PR #1336: URL: https://github.com/apache/iceberg-rust/pull/1336#issuecomment-2915710134 Hi, @Xuanwo @dentiny We need to revert this pr, the new added update doesn't exist in openapi spec: https://github.com/apache/iceberg/blob/cab0decbb0e32bf314039e30807eb033c50665d5/open-api/rest-catalog-open-api.yaml#L3058 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Introduce snapshot summary properties [iceberg-rust]
liurenjie1024 commented on PR #1336: URL: https://github.com/apache/iceberg-rust/pull/1336#issuecomment-2915721862 Also I don't think we should allow user to udpate snapshot summaries alone, it's dangerous. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Introduce snapshot summary properties [iceberg-rust]
dentiny commented on PR #1336: URL: https://github.com/apache/iceberg-rust/pull/1336#issuecomment-2915720658 > Hi, @Xuanwo @dentiny We need to revert this pr, the new added update doesn't exist in openapi spec: https://github.com/apache/iceberg/blob/cab0decbb0e32bf314039e30807eb033c50665d5/open-api/rest-catalog-open-api.yaml#L3058 Hi @liurenjie1024 , there's a `SnapshotUpdate` action in Java implementation, which supports setting summary property. https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/SnapshotUpdate.java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Introduce snapshot summary properties [iceberg-rust]
dentiny commented on PR #1336: URL: https://github.com/apache/iceberg-rust/pull/1336#issuecomment-2915729660 > Also I don't think we should allow user to udpate snapshot summaries alone, it's dangerous. HI @liurenjie1024, curious if you have other suggestions to update snapshot summary? My feature request is to write customized metadata for snapshot. The Java impl I linked above seems to fulfill the requirement, and is exposed as public? Would like to know your thoughts :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Revert "feat: Introduce snapshot summary properties (#1336)" [iceberg-rust]
liurenjie1024 merged PR #1390: URL: https://github.com/apache/iceberg-rust/pull/1390 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Data race in `inclusiveMetricsEval` [iceberg-go]
psarna opened a new issue, #443: URL: https://github.com/apache/iceberg-go/issues/443 ### Apache Iceberg version None ### Please describe the bug 🐞 When running the `TestMergeManifests` test with `-race` flag, a data report is created (attached to the end of the message). I only skimmed the code, but it looks like the culprit is concurrent fetching of manifest entries [here](https://github.com/apache/iceberg-go/blob/85f700808061fcea3655412245ef5d59f1784b69/table/scanner.go#L356). All concurrent requests write to the same structure [here](https://github.com/apache/iceberg-go/blob/85f700808061fcea3655412245ef5d59f1784b69/table/evaluators.go#L754-L756). I confirmed that setting the concurrency here to `1` makes all the tests pass the data race check. ``` $ go test -race -run TestTableWriting/TestMergeManifests == WARNING: DATA RACE Write at 0x00c000f7dc20 by goroutine 353: github.com/apache/iceberg-go/table.(*inclusiveMetricsEval).Eval() /dev/repo/iceberg-go/table/evaluators.go:754 +0xb9 github.com/apache/iceberg-go/table.(*inclusiveMetricsEval).Eval-fm() :1 +0x47 github.com/apache/iceberg-go/table.openManifest() /dev/repo/iceberg-go/table/scanner.go:142 +0x1b2 github.com/apache/iceberg-go/table.(*Scan).collectManifestEntries.func1() /dev/repo/iceberg-go/table/scanner.go:358 +0xd0 golang.org/x/sync/errgroup.(*Group).add.func1() /dev/go/pkg/mod/golang.org/x/sync@v0.14.0/errgroup/errgroup.go:130 +0x141 Previous write at 0x00c000f7dc20 by goroutine 354: github.com/apache/iceberg-go/table.(*inclusiveMetricsEval).Eval() /dev/repo/iceberg-go/table/evaluators.go:754 +0xb9 github.com/apache/iceberg-go/table.(*inclusiveMetricsEval).Eval-fm() :1 +0x47 github.com/apache/iceberg-go/table.openManifest() /dev/repo/iceberg-go/table/scanner.go:142 +0x1b2 github.com/apache/iceberg-go/table.(*Scan).collectManifestEntries.func1() /dev/repo/iceberg-go/table/scanner.go:358 +0xd0 golang.org/x/sync/errgroup.(*Group).add.func1() /dev/go/pkg/mod/golang.org/x/sync@v0.14.0/errgroup/errgroup.go:130 +0x141 Goroutine 353 (running) created at: golang.org/x/sync/errgroup.(*Group).add() /dev/go/pkg/mod/golang.org/x/sync@v0.14.0/errgroup/errgroup.go:98 +0xe4 golang.org/x/sync/errgroup.(*Group).Go() /dev/go/pkg/mod/golang.org/x/sync@v0.14.0/errgroup/errgroup.go:93 +0x791 github.com/apache/iceberg-go/table.(*Scan).collectManifestEntries() /dev/repo/iceberg-go/table/scanner.go:356 +0x531 github.com/apache/iceberg-go/table.(*Scan).PlanFiles() /dev/repo/iceberg-go/table/scanner.go:399 +0xb1 github.com/apache/iceberg-go/table.(*Scan).ToArrowRecords() /dev/repo/iceberg-go/table/scanner.go:441 +0x64 github.com/apache/iceberg-go/table.(*Scan).ToArrowTable() /dev/repo/iceberg-go/table/scanner.go:474 +0x8c github.com/apache/iceberg-go/table_test.(*TableWritingTestSuite).TestMergeManifests() /dev/repo/iceberg-go/table/table_test.go:1125 +0x21f0 github.com/apache/iceberg-go/table_test.arrowTableWithNull() /dev/repo/iceberg-go/table/table_test.go:974 +0x14b github.com/apache/iceberg-go/table_test.(*TableWritingTestSuite).TestMergeManifests() /dev/repo/iceberg-go/table/table_test.go:1067 +0x7d1 runtime.call16() /dev/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.linux-amd64/src/runtime/asm_amd64.s:775 +0x42 reflect.Value.Call() /dev/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.linux-amd64/src/reflect/value.go:368 +0xb5 github.com/stretchr/testify/suite.Run.func1() /dev/go/pkg/mod/github.com/stretchr/testify@v1.10.0/suite/suite.go:202 +0x6ed testing.tRunner() /dev/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.linux-amd64/src/testing/testing.go:1792 +0x225 testing.(*T).Run.gowrap1() /dev/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.linux-amd64/src/testing/testing.go:1851 +0x44 Goroutine 354 (finished) created at: golang.org/x/sync/errgroup.(*Group).add() /dev/go/pkg/mod/golang.org/x/sync@v0.14.0/errgroup/errgroup.go:98 +0xe4 golang.org/x/sync/errgroup.(*Group).Go() /dev/go/pkg/mod/golang.org/x/sync@v0.14.0/errgroup/errgroup.go:93 +0x791 github.com/apache/iceberg-go/table.(*Scan).collectManifestEntries() /dev/repo/iceberg-go/table/scanner.go:356 +0x531 github.com/apache/iceberg-go/table.(*Scan).PlanFiles() /dev/repo/iceberg-go/table/scanner.go:399 +0xb1 github.com/apache/iceberg-go/table.(*Scan).ToArrowRecords() /dev/repo/iceberg-go/table/scanner.go:441 +0x64 github.com/apache/iceberg-go/table.(*Scan).ToArrowTable() /dev/repo/iceberg-go/table/scanner.go:474 +0x8c github.com/apache/iceberg-go/table_test.(*TableWritingTestS
Re: [PR] Docs: add column descriptions for entries metadata table [iceberg]
manuzhang commented on code in PR #13104: URL: https://github.com/apache/iceberg/pull/13104#discussion_r2111095523 ## docs/docs/spark-queries.md: ## @@ -301,6 +301,16 @@ SELECT * FROM prod.db.table.entries; | -- | -- | -- | -- | -- | -- | | 2 | 57897183625154 | 0 | 0 | {"content":0,"file_path":"s3:/.../table/data/00047-25-833044d0-127b-415c-b874-038a4f978c29-00612.parquet","file_format":"PARQUET","spec_id":0,"record_count":15,"file_size_in_bytes":473,"column_sizes":{1:103},"value_counts":{1:15},"null_value_counts":{1:0},"nan_value_counts":{},"lower_bounds":{1:},"upper_bounds":{1:},"key_metadata":null,"split_offsets":[4],"equality_ids":null,"sort_order_id":0} | {"c1":{"column_size":103,"value_count":15,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":3}} | +Note: + +1. The columns of the `entries` table correspond to the fields of the `manifest_entry` struct (see the [manifest file schema](../../spec.md#manifests) for the full definition): +- `status`: Used to track additions and deletions +- `snapshot_id`: The ID of the snapshot in which the file was added or removed +- `sequence_number`: Used for ordering changes across snapshots +- `file_sequence_number`: Indicates when the file was added +- `data_file`: Metadata about the data file Review Comment: The issue is we don't have anchors for `manifest_entry` and `data_file` to link to. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Swith metadata pointer to new location for different catalogs(sql, hive, glue, etc) [iceberg-rust]
liurenjie1024 opened a new issue, #1389: URL: https://github.com/apache/iceberg-rust/issues/1389 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Glue catalog: avoid reloading table in `commit_table` [iceberg-python]
psavalle opened a new issue, #2051: URL: https://github.com/apache/iceberg-python/issues/2051 ### Feature Request / Improvement The Glue catalog implementation [always reloads the table](https://github.com/apache/iceberg-python/blob/main/pyiceberg/catalog/glue.py#L505) in `GlueCatalog.commit_table`, even though a `Table` object is passed in. It would be useful to have to opportunity to skip that for use cases that write frequently and already have an up-to-date `Table` object, as the extra table reload can take some time (esp. on tables with lots of checkpoints). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add asan and ubsan support to cmake [iceberg-cpp]
wgtmac commented on code in PR #107: URL: https://github.com/apache/iceberg-cpp/pull/107#discussion_r2111250416 ## .github/workflows/sanitizer_test.yml: ## @@ -0,0 +1,63 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: ASAN and UBSAN Tests + +on: + push: +branches: + - '**' + - '!dependabot/**' +tags: + - '**' + pull_request: + +concurrency: + group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} + cancel-in-progress: true + +permissions: + contents: read + +jobs: + sanitizer-test: +name: "ASAN and UBSAN Tests" +runs-on: ubuntu-24.04 +strategy: + fail-fast: false +steps: + - name: Checkout iceberg-cpp +uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + - name: Configure and Build with ASAN & UBSAN +run: | + mkdir build && cd build + cmake .. -DCMAKE_BUILD_TYPE=Debug -DICEBERG_ENABLE_ASAN=ON -DICEBERG_ENABLE_UBSAN=ON + cmake --build . --verbose + - name: Run Tests +working-directory: build +env: + ASAN_OPTIONS: log_path=out.log:detect_leaks=1:symbolize=1:strict_string_checks=1:halt_on_error=0:detect_container_overflow=0 Review Comment: ```suggestion ASAN_OPTIONS: log_path=out.log:detect_leaks=1:symbolize=1:strict_string_checks=1:halt_on_error=1:detect_container_overflow=0 ``` @raulcd Perhaps we need to enable `halt_on_error` to avoid the errors being swallowed by the log files. ## .github/workflows/sanitizer_test.yml: ## @@ -0,0 +1,63 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: ASAN and UBSAN Tests + +on: + push: +branches: + - '**' + - '!dependabot/**' +tags: + - '**' + pull_request: + +concurrency: + group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} + cancel-in-progress: true + +permissions: + contents: read + +jobs: + sanitizer-test: +name: "ASAN and UBSAN Tests" +runs-on: ubuntu-24.04 +strategy: + fail-fast: false +steps: + - name: Checkout iceberg-cpp +uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + - name: Configure and Build with ASAN & UBSAN +run: | + mkdir build && cd build + cmake .. -DCMAKE_BUILD_TYPE=Debug -DICEBERG_ENABLE_ASAN=ON -DICEBERG_ENABLE_UBSAN=ON + cmake --build . --verbose + - name: Run Tests +working-directory: build +env: + ASAN_OPTIONS: log_path=out.log:detect_leaks=1:symbolize=1:strict_string_checks=1:halt_on_error=0:detect_container_overflow=0 + LSAN_OPTIONS: suppressions=${{ github.workspace }}/.github/lsan-suppressions.txt + UBSAN_OPTIONS: log_path=out.log:print_stacktrace=1:suppressions=${{ github.workspace }}/.github/ubsan-suppressions.txt Review Comment: ```suggestion UBSAN_OPTIONS: log_path=out.log:halt_on_error=1:print_stacktrace=1:suppressions=${{ github.workspace }}/.github/ubsan-suppressions.txt ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org -
Re: [PR] feat: add asan and ubsan support to cmake [iceberg-cpp]
raulcd commented on PR #107: URL: https://github.com/apache/iceberg-cpp/pull/107#issuecomment-2915302693 I am slightly confused, the log output from the job, see the artifact uploaded here: https://github.com/apache/iceberg-cpp/actions/runs/15293677546?pr=107 suggest there are several leaks, example: ``` Indirect leak of 31 byte(s) in 1 object(s) allocated from: #0 0x7f1fcbcfe548 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:95 #1 0x557a45d3e428 in std::__new_allocator::allocate(unsigned long, void const*) /usr/include/c++/13/bits/new_allocator.h:151 #2 0x557a45cf56e5 in std::allocator::allocate(unsigned long) /usr/include/c++/13/bits/allocator.h:198 #3 0x557a45cf56e5 in std::allocator_traits >::allocate(std::allocator&, unsigned long) /usr/include/c++/13/bits/alloc_traits.h:482 #4 0x557a45cf56e5 in std::__cxx11::basic_string, std::allocator >::_S_allocate(std::allocator&, unsigned long) /usr/include/c++/13/bits/basic_string.h:126 #5 0x557a45cf3657 in std::__cxx11::basic_string, std::allocator >::_M_create(unsigned long&, unsigned long) /usr/include/c++/13/bits/basic_string.tcc:159 #6 0x557a45cf5196 in std::__cxx11::basic_string, std::allocator >::_M_mutate(unsigned long, unsigned long, char const*, unsigned long) /usr/include/c++/13/bits/basic_string.tcc:332 #7 0x557a45cddb0c in std::__cxx11::basic_string, std::allocator >::_M_replace_aux(unsigned long, unsigned long, unsigned long, char) /usr/include/c++/13/bits/basic_string.tcc:468 #8 0x557a45e0eefd in std::__cxx11::basic_string, std::allocator >::append(unsigned long, char) /usr/include/c++/13/bits/basic_string.h:1488 #9 0x557a45e1ac3a in std::__cxx11::basic_string, std::allocator >::resize(unsigned long, char) /usr/include/c++/13/bits/basic_string.tcc:405 #10 0x557a45e1737f in std::__cxx11::basic_string, std::allocator >::resize(unsigned long) /usr/include/c++/13/bits/basic_string.h:1114 #11 0x557a465e9c49 in EncodeMetadata /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/vendoredarrow-src/cpp/src/arrow/c/bridge.cc:154 #12 0x557a465eb370 in ExportMetadata /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/vendoredarrow-src/cpp/src/arrow/c/bridge.cc:296 #13 0x557a465ea162 in ExportField /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/vendoredarrow-src/cpp/src/arrow/c/bridge.cc:189 #14 0x557a465eb1b2 in ExportChildren /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/vendoredarrow-src/cpp/src/arrow/c/bridge.cc:283 #15 0x557a465ea5cb in ExportSchema /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/vendoredarrow-src/cpp/src/arrow/c/bridge.cc:209 #16 0x557a465ed9db in arrow::ExportSchema(arrow::Schema const&, ArrowSchema*) /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/vendoredarrow-src/cpp/src/arrow/c/bridge.cc:523 #17 0x557a45c92944 in iceberg::FromArrowSchemaTest_StructType_Test::TestBody() /home/runner/work/iceberg-cpp/iceberg-cpp/test/arrow_test.cc:357 #18 0x557a46fb1d22 in void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:2638 #19 0x557a46faac32 in void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:2674 #20 0x557a46f85b37 in testing::Test::Run() /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:2713 #21 0x557a46f865f5 in testing::TestInfo::Run() /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:2859 #22 0x557a46f86fbd in testing::TestSuite::Run() /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:3037 #23 0x557a46f96fe3 in testing::internal::UnitTestImpl::RunAllTests() /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:5967 #24 0x557a46fb31a7 in bool testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:2638 #25 0x557a46fabeda in bool testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:2674 #26 0x557a46f954cf in testing::UnitTest::Run() /home/runner/work/iceberg-cpp/iceberg-cpp/build/_deps/googletest-src/googletest/src/gtest.cc:5546 #27 0x557a45e51ea1 in RUN_ALL_TESTS() /home/runner
Re: [PR] Flink: Dynamic Iceberg Sink Contribution [iceberg]
b-rick commented on code in PR #12424: URL: https://github.com/apache/iceberg/pull/12424#discussion_r2111204802 ## flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/RowDataEvolver.java: ## @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink.dynamic; + +import java.time.LocalDate; +import java.time.LocalDateTime; +import java.time.LocalTime; +import java.util.List; +import java.util.Map; +import org.apache.flink.table.data.ArrayData; +import org.apache.flink.table.data.DecimalData; +import org.apache.flink.table.data.GenericArrayData; +import org.apache.flink.table.data.GenericMapData; +import org.apache.flink.table.data.GenericRowData; +import org.apache.flink.table.data.MapData; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.data.TimestampData; +import org.apache.flink.table.types.logical.ArrayType; +import org.apache.flink.table.types.logical.DecimalType; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.flink.table.types.logical.MapType; +import org.apache.flink.table.types.logical.RowType; +import org.apache.iceberg.Schema; +import org.apache.iceberg.flink.FlinkSchemaUtil; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; + +public class RowDataEvolver { + private RowDataEvolver() {} + + public static RowData convert(RowData sourceData, Schema sourceSchema, Schema targetSchema) { +return convertStruct( +sourceData, FlinkSchemaUtil.convert(sourceSchema), FlinkSchemaUtil.convert(targetSchema)); Review Comment: Indeed, but the case where the record schema does not match the table schema is very common. E.g a producer who only ever was publishing a schema with a single column "a", and now the schema evolves, and the producer only publishes column "b". The table schema will be {"a", "b"} (with a unioning strategy), and we expect to continue publishing without fault. But because of the schema difference, we hit this slow code path for every single record. ## flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/RowDataEvolver.java: ## @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink.dynamic; + +import java.time.LocalDate; +import java.time.LocalDateTime; +import java.time.LocalTime; +import java.util.List; +import java.util.Map; +import org.apache.flink.table.data.ArrayData; +import org.apache.flink.table.data.DecimalData; +import org.apache.flink.table.data.GenericArrayData; +import org.apache.flink.table.data.GenericMapData; +import org.apache.flink.table.data.GenericRowData; +import org.apache.flink.table.data.MapData; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.data.TimestampData; +import org.apache.flink.table.types.logical.ArrayType; +import org.apache.flink.table.types.logical.DecimalType; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.flink.table.types.logical.MapType; +import org.apache.flink.table.types.logical.RowType; +import org.apache.iceberg.Schema; +import org.apache.iceberg.flink.FlinkSchemaUtil; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; + +public class RowDataEvolver { + private RowDataEvolver() {} + + public static RowData convert(RowData sourceData, Schema sourceSchema, Schema targetSchema)
[PR] feat: implement avro file reader [iceberg-cpp]
wgtmac opened a new pull request, #113: URL: https://github.com/apache/iceberg-cpp/pull/113 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Introduce snapshot summary properties [iceberg-rust]
liurenjie1024 commented on code in PR #1336: URL: https://github.com/apache/iceberg-rust/pull/1336#discussion_r2111438158 ## crates/iceberg/src/catalog/mod.rs: ## @@ -495,6 +495,12 @@ pub enum TableUpdate { /// Schema IDs to remove. schema_ids: Vec, }, +/// Add snapshot summary properties. +#[serde(rename_all = "kebab-case")] +AddSnapshotSummaryProperties { Review Comment: This table update doesn't exist in table updates, see https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Introduce snapshot summary properties [iceberg-rust]
liurenjie1024 commented on PR #1336: URL: https://github.com/apache/iceberg-rust/pull/1336#issuecomment-2915765649 > > Also I don't think we should allow user to udpate snapshot summaries alone, it's dangerous. > > HI @liurenjie1024, curious if you have other suggestions to update snapshot summary? My feature request is to write customized metadata for snapshot. > > The Java impl I linked above seems to fulfill the requirement, and is exposed as public? Would like to know your thoughts :) The `SnapshotUpdate` is a super interface of tx actions that could produce a new snapshot, so in theory you could use any tx action that produces a new snapshot to do that. Currently, you could you `FastAppend`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Revert "feat: Introduce snapshot summary properties (#1336)" [iceberg-rust]
liurenjie1024 commented on PR #1390: URL: https://github.com/apache/iceberg-rust/pull/1390#issuecomment-2915775716 cc @Xuanwo @sdd @Fokko @kevinjqliu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] [EPIC] Support for appending data to iceberg table. [iceberg-rust]
liurenjie1024 commented on issue #1382: URL: https://github.com/apache/iceberg-rust/issues/1382#issuecomment-2915783715 > Hi [@liurenjie1024](https://github.com/liurenjie1024) , thanks for having this! > > Would it be ok if I take up the commit path work? I can start working on a POC tomorrow Cool, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] [EPIC] Support for appending data to iceberg table. [iceberg-rust]
liurenjie1024 commented on issue #1382: URL: https://github.com/apache/iceberg-rust/issues/1382#issuecomment-2915783358 > I don't get why we need this. Is this used by the catalog implementation, e.g. SqlCatalog::update_table? Could you elaborate more the commit process and where we need this? Yes. Catalogs uses this new method to generate a new table metadata, and then it points to new table metadata. #964 duplicates with this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]
liziyan-lzy commented on code in PR #12254: URL: https://github.com/apache/iceberg/pull/12254#discussion_r237312 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java: ## @@ -303,39 +310,89 @@ private Dataset listedFileDS() { List subDirs = Lists.newArrayList(); List matchingFiles = Lists.newArrayList(); -Predicate predicate = file -> file.getModificationTime() < olderThanTimestamp; PathFilter pathFilter = PartitionAwareHiddenPathFilter.forSpecs(table.specs()); -// list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have -// less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the driver -listDirRecursively( -location, -predicate, -hadoopConf.value(), -MAX_DRIVER_LISTING_DEPTH, -MAX_DRIVER_LISTING_DIRECT_SUB_DIRS, -subDirs, -pathFilter, -matchingFiles); - -JavaRDD matchingFileRDD = sparkContext().parallelize(matchingFiles, 1); - -if (subDirs.isEmpty()) { +if (usePrefixListing) { + Preconditions.checkArgument( + table.io() instanceof SupportsPrefixOperations, + "Table file io should prefix operations when using prefix list."); Review Comment: Thank you very much for identifying this issue. I’ve corrected it now. ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java: ## @@ -303,39 +310,89 @@ private Dataset listedFileDS() { List subDirs = Lists.newArrayList(); List matchingFiles = Lists.newArrayList(); -Predicate predicate = file -> file.getModificationTime() < olderThanTimestamp; PathFilter pathFilter = PartitionAwareHiddenPathFilter.forSpecs(table.specs()); -// list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have -// less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the driver -listDirRecursively( -location, -predicate, -hadoopConf.value(), -MAX_DRIVER_LISTING_DEPTH, -MAX_DRIVER_LISTING_DIRECT_SUB_DIRS, -subDirs, -pathFilter, -matchingFiles); - -JavaRDD matchingFileRDD = sparkContext().parallelize(matchingFiles, 1); - -if (subDirs.isEmpty()) { +if (usePrefixListing) { + Preconditions.checkArgument( + table.io() instanceof SupportsPrefixOperations, + "Table file io should prefix operations when using prefix list."); + + Predicate predicate = + fileInfo -> fileInfo.createdAtMillis() < olderThanTimestamp; + listDirRecursivelyWithFileIO( + (SupportsPrefixOperations) table.io(), location, predicate, pathFilter, matchingFiles); + + JavaRDD matchingFileRDD = sparkContext().parallelize(matchingFiles, 1); return spark().createDataset(matchingFileRDD.rdd(), Encoders.STRING()); +} else { + Predicate predicate = file -> file.getModificationTime() < olderThanTimestamp; + // list at most MAX_DRIVER_LISTING_DEPTH levels and only dirs that have + // less than MAX_DRIVER_LISTING_DIRECT_SUB_DIRS direct sub dirs on the driver + listDirRecursivelyWithHadoop( + location, + predicate, + hadoopConf.value(), + MAX_DRIVER_LISTING_DEPTH, + MAX_DRIVER_LISTING_DIRECT_SUB_DIRS, + subDirs, + pathFilter, + matchingFiles); + + JavaRDD matchingFileRDD = sparkContext().parallelize(matchingFiles, 1); + + if (subDirs.isEmpty()) { +return spark().createDataset(matchingFileRDD.rdd(), Encoders.STRING()); + } + + int parallelism = Math.min(subDirs.size(), listingParallelism); + JavaRDD subDirRDD = sparkContext().parallelize(subDirs, parallelism); + + Broadcast conf = sparkContext().broadcast(hadoopConf); + ListDirsRecursively listDirs = new ListDirsRecursively(conf, olderThanTimestamp, pathFilter); + JavaRDD matchingLeafFileRDD = subDirRDD.mapPartitions(listDirs); + + JavaRDD completeMatchingFileRDD = matchingFileRDD.union(matchingLeafFileRDD); + return spark().createDataset(completeMatchingFileRDD.rdd(), Encoders.STRING()); +} + } + + private static void listDirRecursivelyWithFileIO( + SupportsPrefixOperations io, + String dir, + Predicate predicate, + PathFilter pathFilter, + List matchingFiles) { +String listPath = dir; +if (!dir.endsWith("/")) { + listPath = dir + "/"; +} + +Iterable files = io.listPrefix(listPath); +for (org.apache.iceberg.io.FileInfo file : files) { + Path path = new Path(file.location()); Review Comment: Thanks for highlight this. I’ll verify the issue to ensure proper path handling for S3, etc., and will follow up if it is an issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme
Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]
liziyan-lzy commented on code in PR #12254: URL: https://github.com/apache/iceberg/pull/12254#discussion_r237789 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java: ## @@ -344,6 +374,8 @@ public void testWapFilesAreKept() { // wap write df.select("c1", "c2", "c3").write().format("iceberg").mode("append").save(tableLocation); +spark.conf().unset(SparkSQLProperties.WAP_ID); Review Comment: This configuration may cause test failures if enabled during multiple test runs. It is added to ensure that this configuration is removed after each test completes. ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java: ## @@ -1067,6 +1148,38 @@ public void testRemoveOrphanFileActionWithDeleteMode() { DeleteOrphanFiles.PrefixMismatchMode.DELETE); } + @TestTemplate + public void testDefaultToHadoopListing() { +assumeThat(usePrefixListing) +.as("Should not test both prefix listing and Hadoop file listing (redundant)") Review Comment: Fixed. Thanks for pointing this out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: KeyManagementClient implementation that works with AWS KMS [iceberg]
jbonofre commented on PR #13136: URL: https://github.com/apache/iceberg/pull/13136#issuecomment-2915277304 I will take a look as well. Thanks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] AWS: KeyManagementClient implementation that works with AWS KMS [iceberg]
jbonofre commented on PR #13136: URL: https://github.com/apache/iceberg/pull/13136#issuecomment-2915276780 @nastra @Fokko ^^ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]
liziyan-lzy commented on code in PR #12254: URL: https://github.com/apache/iceberg/pull/12254#discussion_r245839 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java: ## @@ -107,9 +115,12 @@ public abstract class TestRemoveOrphanFilesAction extends TestBase { protected Map properties; @Parameter private int formatVersion; - @Parameters(name = "formatVersion = {0}") + @Parameter(index = 1) + private boolean usePrefixListing; + + @Parameters(name = "formatVersion = {0}, usePrefixListing = {1}") protected static List parameters() { -return Arrays.asList(2, 3); +return Arrays.asList(new Object[] {2, true}, new Object[] {2, false}, new Object[] {3, false}); Review Comment: Thanks for your input. I've now updated it to ensure tests are run across all versions to cover everything thoroughly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Dynamic Iceberg Sink Contribution [iceberg]
mxm commented on code in PR #12424: URL: https://github.com/apache/iceberg/pull/12424#discussion_r265456 ## flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/RowDataEvolver.java: ## @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.flink.sink.dynamic; + +import java.time.LocalDate; +import java.time.LocalDateTime; +import java.time.LocalTime; +import java.util.List; +import java.util.Map; +import org.apache.flink.table.data.ArrayData; +import org.apache.flink.table.data.DecimalData; +import org.apache.flink.table.data.GenericArrayData; +import org.apache.flink.table.data.GenericMapData; +import org.apache.flink.table.data.GenericRowData; +import org.apache.flink.table.data.MapData; +import org.apache.flink.table.data.RowData; +import org.apache.flink.table.data.TimestampData; +import org.apache.flink.table.types.logical.ArrayType; +import org.apache.flink.table.types.logical.DecimalType; +import org.apache.flink.table.types.logical.LogicalType; +import org.apache.flink.table.types.logical.MapType; +import org.apache.flink.table.types.logical.RowType; +import org.apache.iceberg.Schema; +import org.apache.iceberg.flink.FlinkSchemaUtil; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; + +public class RowDataEvolver { + private RowDataEvolver() {} + + public static RowData convert(RowData sourceData, Schema sourceSchema, Schema targetSchema) { +return convertStruct( +sourceData, FlinkSchemaUtil.convert(sourceSchema), FlinkSchemaUtil.convert(targetSchema)); Review Comment: Thanks for pointing this out. This code path will only be called under certain conditions, i.e. when the input schema changes in a way that we can adapt the input records to match the table schema. For example, when the input schema has a field has type int, but the table type is long. Also, we can reorder fields without changing the table schema. It is worthwhile to consider removing this functionality for simplicity, or to add extensive caching like we have for the other hot paths. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix incremental compute of partition stats with COW deletes [iceberg]
ajantha-bhat commented on code in PR #13163: URL: https://github.com/apache/iceberg/pull/13163#discussion_r2111663422 ## core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java: ## @@ -336,16 +336,29 @@ private static PartitionMap computeStatsDiff( Sets.newHashSet( SnapshotUtil.ancestorIdsBetween( toSnapshot.snapshotId(), fromSnapshot.snapshotId(), table::snapshot)); -Predicate manifestFilePredicate = -manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId()); -return computeStats(table, toSnapshot, manifestFilePredicate, true /* incremental */); +return computeStats(table, toSnapshot, snapshotIdsRange); } private static PartitionMap computeStats( - Table table, Snapshot snapshot, Predicate predicate, boolean incremental) { + Table table, Snapshot snapshot, Set snapshotIdsRange) { StructType partitionType = Partitioning.partitionType(table); -List manifests = - snapshot.allManifests(table.io()).stream().filter(predicate).collect(Collectors.toList()); +boolean incremental = !snapshotIdsRange.isEmpty(); + +List manifests; +if (incremental) { + // DELETED manifest entries are not carried over to subsequent snapshots. + // So, for incremental computation, gather the manifests added by each snapshot + // instead of relying solely on those from the latest snapshot. + manifests = + snapshotIdsRange.stream() + .flatMap( + id -> + table.snapshot(id).allManifests(table.io()).stream() + .filter(file -> file.snapshotId().equals(id))) + .collect(Collectors.toList()); Review Comment: Also note that, because of snapshot id filter, Each snapshot's added manifest files will be considered only once for compute. So, reused manifests won't be considered again. If manifests are rewritten, entries will be marked as EXISTING and won't be considered for incremental compute from existing logic in `collectStatsForManifest`. So, IMO it works for all the scenarios now and we have testcase to cover all the scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: Introduce snapshot summary properties [iceberg-rust]
Xuanwo merged PR #1336: URL: https://github.com/apache/iceberg-rust/pull/1336 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Incrementally computing partition stats can miss deleted files [iceberg]
ajantha-bhat commented on issue #13155: URL: https://github.com/apache/iceberg/issues/13155#issuecomment-2916012247 I found a clean way. Existing testacase was failing because I included same manifest twice when reused. Now, I added a snapshot id check. So, each snapshot only considers the manifest added by it. The PR is ready for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix incremental compute of partition stats [iceberg]
ajantha-bhat commented on code in PR #13163: URL: https://github.com/apache/iceberg/pull/13163#discussion_r2111647311 ## core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java: ## @@ -336,16 +336,29 @@ private static PartitionMap computeStatsDiff( Sets.newHashSet( SnapshotUtil.ancestorIdsBetween( toSnapshot.snapshotId(), fromSnapshot.snapshotId(), table::snapshot)); -Predicate manifestFilePredicate = -manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId()); -return computeStats(table, toSnapshot, manifestFilePredicate, true /* incremental */); +return computeStats(table, toSnapshot, snapshotIdsRange); } private static PartitionMap computeStats( - Table table, Snapshot snapshot, Predicate predicate, boolean incremental) { + Table table, Snapshot snapshot, Set snapshotIdsRange) { StructType partitionType = Partitioning.partitionType(table); -List manifests = - snapshot.allManifests(table.io()).stream().filter(predicate).collect(Collectors.toList()); +boolean incremental = !snapshotIdsRange.isEmpty(); + +List manifests; +if (incremental) { + // DELETED manifest entries are not carried over to subsequent snapshots. + // So, for incremental computation, gather the manifests added by each snapshot + // instead of relying solely on those from the latest snapshot. + manifests = + snapshotIdsRange.stream() + .flatMap( + id -> + table.snapshot(id).allManifests(table.io()).stream() + .filter(file -> file.snapshotId().equals(id))) + .collect(Collectors.toList()); Review Comment: I also checked that if snapshots are expired, we cannot find previous stats for the table. So, it will fallback to full compute. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix incremental compute of partition stats [iceberg]
ajantha-bhat commented on code in PR #13163: URL: https://github.com/apache/iceberg/pull/13163#discussion_r2111640567 ## core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java: ## @@ -336,16 +336,22 @@ private static PartitionMap computeStatsDiff( Sets.newHashSet( SnapshotUtil.ancestorIdsBetween( toSnapshot.snapshotId(), fromSnapshot.snapshotId(), table::snapshot)); -Predicate manifestFilePredicate = -manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId()); -return computeStats(table, toSnapshot, manifestFilePredicate, true /* incremental */); +return computeStats(table, toSnapshot, snapshotIdsRange, true /* incremental */); } private static PartitionMap computeStats( - Table table, Snapshot snapshot, Predicate predicate, boolean incremental) { + Table table, Snapshot snapshot, Set snapshotIdsRange, boolean incremental) { StructType partitionType = Partitioning.partitionType(table); -List manifests = - snapshot.allManifests(table.io()).stream().filter(predicate).collect(Collectors.toList()); + +List manifests; +if (incremental) { Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix incremental compute of partition stats [iceberg]
ajantha-bhat commented on code in PR #13163: URL: https://github.com/apache/iceberg/pull/13163#discussion_r2111642709 ## core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java: ## @@ -276,7 +275,8 @@ private static Collection computeAndMergeStatsIncremental( PartitionMap statsMap = PartitionMap.create(table.specs()); // read previous stats, note that partition field will be read as GenericRecord try (CloseableIterable oldStats = -readPartitionStatsFile(schema(partitionType), Files.localInput(previousStatsFile.path( { +readPartitionStatsFile( +schema(partitionType), table.io().newInputFile(previousStatsFile.path( { Review Comment: Just using the FileIO as pointed out in the PR. Unrelated to this bug fix. But related to this feature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix incremental compute of partition stats with COW deletes [iceberg]
ajantha-bhat commented on code in PR #13163: URL: https://github.com/apache/iceberg/pull/13163#discussion_r2111647311 ## core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java: ## @@ -336,16 +336,29 @@ private static PartitionMap computeStatsDiff( Sets.newHashSet( SnapshotUtil.ancestorIdsBetween( toSnapshot.snapshotId(), fromSnapshot.snapshotId(), table::snapshot)); -Predicate manifestFilePredicate = -manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId()); -return computeStats(table, toSnapshot, manifestFilePredicate, true /* incremental */); +return computeStats(table, toSnapshot, snapshotIdsRange); } private static PartitionMap computeStats( - Table table, Snapshot snapshot, Predicate predicate, boolean incremental) { + Table table, Snapshot snapshot, Set snapshotIdsRange) { StructType partitionType = Partitioning.partitionType(table); -List manifests = - snapshot.allManifests(table.io()).stream().filter(predicate).collect(Collectors.toList()); +boolean incremental = !snapshotIdsRange.isEmpty(); + +List manifests; +if (incremental) { + // DELETED manifest entries are not carried over to subsequent snapshots. + // So, for incremental computation, gather the manifests added by each snapshot + // instead of relying solely on those from the latest snapshot. + manifests = + snapshotIdsRange.stream() + .flatMap( + id -> + table.snapshot(id).allManifests(table.io()).stream() + .filter(file -> file.snapshotId().equals(id))) + .collect(Collectors.toList()); Review Comment: I also checked that if snapshots are expired, we cannot find previous stats for the table in the caller. So, it will fallback to full compute. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink 1.19, 1.20: Remove the MiniClusterWithClientResource dependency [iceberg]
JeonDaehong commented on PR #13165: URL: https://github.com/apache/iceberg/pull/13165#issuecomment-2916040410 @nastra Hello, I’ve completed the backport. Please review and merge it when you get a chance! Also, this is a separate topic — After my previous PR was merged, I was expecting to see my name listed on the [contributors page](https://github.com/apache/iceberg/graphs/contributors), but it hasn’t shown up yet. Is merging into main not enough for that? Is there anything else I need to do to be listed? I’m looking forward to continuing to contribute to the Iceberg community and actively helping resolve issues! Thanks always 🙇 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Rest Catalog and Writing data to Minio Raises `OSError: When initiating multiple part upload` [iceberg-python]
hasanozciftcii commented on issue #974: URL: https://github.com/apache/iceberg-python/issues/974#issuecomment-2916096063 > As a workaround, you can define the S3 configurations (endpoint, secret access key, and access key ID) in the load_rest method. > > catalog = load_rest( > name="rest", > conf={ > "uri": "http://localhost:8181/";, > "s3.endpoint": "http://localhost:9000";, > "s3.access-key-id": "admin", > "s3.secret-access-key": "password", > }, > ) > For some reason, the iceberg-rest-image does not return the overrides in the getConfig endpoint. > > If you test with Tabular or other catalogs, you will not have this problem. this works on my end, thanks for help -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix incremental compute of partition stats with COW deletes [iceberg]
pvary commented on code in PR #13163: URL: https://github.com/apache/iceberg/pull/13163#discussion_r2111741154 ## core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java: ## @@ -336,16 +336,29 @@ private static PartitionMap computeStatsDiff( Sets.newHashSet( SnapshotUtil.ancestorIdsBetween( toSnapshot.snapshotId(), fromSnapshot.snapshotId(), table::snapshot)); -Predicate manifestFilePredicate = -manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId()); -return computeStats(table, toSnapshot, manifestFilePredicate, true /* incremental */); +return computeStats(table, toSnapshot, snapshotIdsRange); } private static PartitionMap computeStats( - Table table, Snapshot snapshot, Predicate predicate, boolean incremental) { + Table table, Snapshot snapshot, Set snapshotIdsRange) { Review Comment: I think the parametrization of this method is hard to understand now. Especially with the hidden "empty" `snapshotIdsRange`. What if: - Move the manifest list calculation outside of this method, and use a `List manifests` input parameter. - Add keep the `boolean incremental` parameter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Fix(catalog): Only remove metadata files if TableOp created a new one [iceberg]
fivetran-raulpersa opened a new pull request, #13169: URL: https://github.com/apache/iceberg/pull/13169 Fixes #13014 In multiple Table ops (Glue, Dynamo, BigQuery and Nessie) a failing operation will clean up the metadata file. However it should only do so if it was the one to create the metadata file in the first place. Otherwise, an operation like `registerTable` could corrupt other catalog that rely on the existence of the metadata file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink 1.19, 1.20: Remove the MiniClusterWithClientResource dependency [iceberg]
pvary merged PR #13165: URL: https://github.com/apache/iceberg/pull/13165 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] feat: set snapshot summary property in append action [iceberg-rust]
dentiny opened a new pull request, #1391: URL: https://github.com/apache/iceberg-rust/pull/1391 ## Which issue does this PR close? - Closes https://github.com/apache/iceberg-rust/issues/1329 ## What changes are included in this PR? This PR tries to do the same thing as reverted [PR](https://github.com/apache/iceberg-rust/pull/1336), which adds capability to set snapshot summary properties. Followup on thread: https://github.com/apache/iceberg-rust/pull/1336#issuecomment-2915765649, in this PR I took a different way, which expose interfaces within fast append action. ## Are these changes tested? Yes, I add new unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] fix(table): fix race condition in scanner [iceberg-go]
zeroshade opened a new pull request, #445: URL: https://github.com/apache/iceberg-go/pull/445 fixes #443 Eliminates potential race condition in scanning caused by the metrics evaluator used for manifest pruning. Since the stats are only used for a given run of Eval, we don't need to modify the bound version of the `inclusiveMetricsEval` object and can instead copy the type and expression (cheap) to populate for the call. This avoids the race condition without giving up performance (no locking needed). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] fix(catalog/glue): Fix nil map assignment issue in glue catalog [iceberg-go]
lliangyu-lin opened a new pull request, #446: URL: https://github.com/apache/iceberg-go/pull/446 ### Description When calling NewCatalog in glue catalog without any AWS properties, causing the catalog properties to be nil ``` props: iceberg.Properties(glueOps.awsProperties), ``` Hence, when creating a staged table and need to clone the catalog properties in https://github.com/apache/iceberg-go/blob/d51f8b8c5b8249a706467910d42ee51eb6032ccd/catalog/internal/utils.go#L137-L138, it will result in nil map assignment. Similar is happening in ``` func prepareProperties(icebergProperties iceberg.Properties, newMetadataLocation string) iceberg.Properties { glueProperties := maps.Clone(icebergProperties) ``` where if staged table has no properties in the metadata, it will return nil and causing nil map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support decompress gzip metadata [iceberg-cpp]
Fokko merged PR #108: URL: https://github.com/apache/iceberg-cpp/pull/108 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support decompress gzip metadata [iceberg-cpp]
Fokko commented on PR #108: URL: https://github.com/apache/iceberg-cpp/pull/108#issuecomment-2917339595 LGTM @dongxiao1198 , thanks for the review @wgtmac, @lidavidm and @yingcai-cy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Append-only option for Spark incremental append read that throws [iceberg]
smaheshwar-pltr opened a new issue, #13179: URL: https://github.com/apache/iceberg/issues/13179 ### Feature Request / Improvement Spark's [incremental append reading](https://iceberg.apache.org/docs/latest/spark-queries/#incremental-read) is useful for append-only incremental workflows. It's documented as only reading snapshots with "append" snapshot types - it ignores any snapshots in the range that don't have this type, e.g. `DELETE` snapshots. I think this is sensible default behaviour, but I'm wondering what the community thinks about allowing an `append-only` option to be set that *throws* if there are non-append snapshots in the range, to mitigate data integrity of a downstream table being compromised for strictly append-only workflows. (It would be unset by default) ### Query engine None ### Willingness to contribute - [x] I can contribute this improvement/feature independently - [ ] I would be willing to contribute this improvement/feature with guidance from the Iceberg community - [ ] I cannot contribute this improvement/feature at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Incremental Append Scan [iceberg-python]
smaheshwar-pltr commented on PR #2031: URL: https://github.com/apache/iceberg-python/pull/2031#issuecomment-2917350349 Put up https://github.com/apache/iceberg/issues/13179 re an append-only option -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Append-only option for Spark incremental append read that throws [iceberg]
smaheshwar-pltr commented on issue #13179: URL: https://github.com/apache/iceberg/issues/13179#issuecomment-2917349624 Not sure who to ping here so CC-ing folks I see on previous Spark incremental read work 😄: @aokolnychyi @huaxingao @szehon-ho @RussellSpitzer, please let me know your thoughts! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Build: Fix errorprone warnings [iceberg]
ajantha-bhat opened a new issue, #13178: URL: https://github.com/apache/iceberg/issues/13178 ### Apache Iceberg version None ### Query engine None ### Please describe the bug 🐞 `./gradlew clean build -x test -x integrationTest --no-build-cache` Default profile build with above command results in few errorprone warnings. https://github.com/user-attachments/assets/77a1912d-8b36-4910-a707-d0f1fc64a4ee"; /> Good to keep the build green. ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: set snapshot summary property in append action [iceberg-rust]
dentiny commented on code in PR #1391: URL: https://github.com/apache/iceberg-rust/pull/1391#discussion_r2112436407 ## Cargo.lock: ## @@ -3712,7 +3712,7 @@ dependencies = [ "iceberg-catalog-rest", "iceberg-datafusion", "iceberg_test_utils", - "ordered-float 4.6.0", + "ordered-float 2.10.1", Review Comment: I didn't make the change, `cargo test` did -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] build(deps): bump the gomod_updates group with 3 updates [iceberg-go]
zeroshade merged PR #440: URL: https://github.com/apache/iceberg-go/pull/440 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Feature request: expose `DataFile` builder as public [iceberg-rust]
dentiny closed issue #1392: Feature request: expose `DataFile` builder as public URL: https://github.com/apache/iceberg-rust/issues/1392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Feature request: expose `DataFile` builder as public [iceberg-rust]
dentiny commented on issue #1392: URL: https://github.com/apache/iceberg-rust/issues/1392#issuecomment-2917202154 Sorry I get fooled by gpt and deepseek, the default visibility is public Ref: https://docs.rs/derive_builder/latest/derive_builder/#setter-visibility -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Build: Fix errorprone warnings [iceberg]
TomasDarquier commented on issue #13178: URL: https://github.com/apache/iceberg/issues/13178#issuecomment-2917209543 - [x] I can contribute a fix for this bug independently Hi, I’d like to work on this issue — it seems like a great fit for my first contribution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Build: Fix errorprone warnings [iceberg]
ajantha-bhat commented on issue #13178: URL: https://github.com/apache/iceberg/issues/13178#issuecomment-2917115167 Good first issue for new contributors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Feature request: expose `DataFile` builder as public [iceberg-rust]
dentiny opened a new issue, #1392: URL: https://github.com/apache/iceberg-rust/issues/1392 ### Is your feature request related to a problem or challenge? My workflow looks like - Accumulate arrow batches in-memory, flush to local filesystem when it reaches certain threshold - Periodically import local parquet files to iceberg table iceberg-rust doesn't provide a util function or class to generate `DataFile`, which is needed for fast append action (apart from `DataFileWriter`), so I have to read the whole arrow batches and write it with `DataFileWriter`, which is definitely inefficient. I would like to expose `DataFileBuilder` as a public struct, so I could construct these fields myself. ### Describe the solution you'd like Expose `DataFileBuilder` as public ### Willingness to contribute I can contribute to this feature independently -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink 1.19, 1.20: Remove the MiniClusterWithClientResource dependency [iceberg]
pvary commented on PR #13165: URL: https://github.com/apache/iceberg/pull/13165#issuecomment-2916219167 Merged to main. Thanks @JeonDaehong for the backport and @nastra for the review. >After my previous PR was merged, I was expecting to see my name listed on the [contributors page](https://github.com/apache/iceberg/graphs/contributors), but it hasn’t shown up yet. > > Is merging into main not enough for that? > Is there anything else I need to do to be listed? I see people with 7-8 commits on the list. Maybe that's the threshold? > I’m looking forward to continuing to contribute to the Iceberg community and actively helping resolve issues! We are looking forward your contributions! Welcome to the team! 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add retryable property for Error [iceberg-rust]
ZENOTME commented on code in PR #1383: URL: https://github.com/apache/iceberg-rust/pull/1383#discussion_r2111914945 ## crates/iceberg/src/error.rs: ## @@ -134,6 +134,8 @@ pub struct Error { source: Option, backtrace: Backtrace, + +retryable: bool, Review Comment: > This is quite confusing. Whether it's retryable should be determined by consumer by ErrorKind. `temporary: bool` +1. I think the basic idea here is to let retryable or temporary property orthogonal of ErrorKind. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix incremental compute of partition stats with COW deletes [iceberg]
ajantha-bhat commented on code in PR #13163: URL: https://github.com/apache/iceberg/pull/13163#discussion_r2111918206 ## core/src/main/java/org/apache/iceberg/PartitionStatsHandler.java: ## @@ -336,16 +336,29 @@ private static PartitionMap computeStatsDiff( Sets.newHashSet( SnapshotUtil.ancestorIdsBetween( toSnapshot.snapshotId(), fromSnapshot.snapshotId(), table::snapshot)); -Predicate manifestFilePredicate = -manifestFile -> snapshotIdsRange.contains(manifestFile.snapshotId()); -return computeStats(table, toSnapshot, manifestFilePredicate, true /* incremental */); +return computeStats(table, toSnapshot, snapshotIdsRange); } private static PartitionMap computeStats( - Table table, Snapshot snapshot, Predicate predicate, boolean incremental) { + Table table, Snapshot snapshot, Set snapshotIdsRange) { Review Comment: Done. I had considered it earlier as well, but decided to leave it as-is since I felt I might be overthinking the refactoring of this private method. But it makes sense now as it is very clean now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Support commit retrie [iceberg-rust]
ZENOTME closed issue #964: Support commit retrie URL: https://github.com/apache/iceberg-rust/issues/964 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink 1.19, 1.20: Remove the MiniClusterWithClientResource dependency [iceberg]
JeonDaehong commented on PR #13165: URL: https://github.com/apache/iceberg/pull/13165#issuecomment-2916382580 > Merged to main. Thanks @JeonDaehong for the backport and @nastra for the review. > > > After my previous PR was merged, I was expecting to see my name listed on the [contributors page](https://github.com/apache/iceberg/graphs/contributors), but it hasn’t shown up yet. > > Is merging into main not enough for that? > > Is there anything else I need to do to be listed? > > I see people with 7-8 commits on the list. Maybe that's the threshold? > > > I’m looking forward to continuing to contribute to the Iceberg community and actively helping resolve issues! > > We are looking forward your contributions! Welcome to the team! 😄 Thank you for merging! I’ll continue to do my best and stay active! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Proposal: IRC Events endpoint [iceberg]
c-thiel commented on code in PR #12584: URL: https://github.com/apache/iceberg/pull/12584#discussion_r2112113418 ## open-api/rest-catalog-open-api.yaml: ## @@ -3964,6 +4172,302 @@ components: metadata: $ref: '#/components/schemas/TableMetadata' +EventsResponse: + type: object + required: +- highest-processed-timestamp-ms +- events + properties: +next-page-token: + $ref: "#/components/schemas/PageToken" +highest-processed-timestamp-ms: + description: > +The highest timestamp processed by the server when generating this response. +This may not necessarily appear in the returned changes if it was filtered out. + +Clients can use this value as the `after-timestamp-ms` parameter in subsequent +requests to continue retrieving changes after this point. + type: integer + format: int64 +events: + type: array + items: +$ref: "#/components/schemas/Event" + +Event: + type: object + required: +- event-id +- request-id +- event-count +- timestamp-ms +- operation + properties: +event-id: + type: string + description: Unique ID of this event. Clients should perform deduplication based on this ID. +request-id: + description: ID of the request this change belongs to. + type: string +event-count: + type: integer + description: Number of events in the request / batch of events +timestamp-ms: + type: integer + format: int64 + description: > +Timestamp when this transaction occurred (epoch milliseconds). +Timestamps are not guaranteed to be unique. Typically all events in +a transaction will have the same timestamp. +actor-chain: + type: array + items: +$ref: "#/components/schemas/Actor" + description: > +An ordered list of actors involved in the operation, with the most direct actor +(the one who actually performed the operation) first, followed by delegating actors +in order. For example, if a service account (actor[0]) performed an operation +on behalf of a role (actor[1]) assumed by a user (actor[2]), the chain represents +this delegation path. +operation: + type: object + discriminator: +propertyName: operation-type +mapping: + create-table: "#/components/schemas/CreateTableOperation" + register-table: "#/components/schemas/RegisterTableOperation" + drop-table: "#/components/schemas/DropTableOperation" + update-table: "#/components/schemas/UpdateTableOperation" + rename-table: "#/components/schemas/RenameTableOperation" + create-view: "#/components/schemas/CreateViewOperation" + drop-view: "#/components/schemas/DropViewOperation" + replace-view: "#/components/schemas/ReplaceViewOperation" + rename-view: "#/components/schemas/RenameTableOperation" + create-namespace: "#/components/schemas/CreateNamespaceOperation" + update-namespace-properties: "#/components/schemas/UpdateNamespacePropertiesOperation" + drop-namespace: "#/components/schemas/DropNamespaceOperation" + custom: "#/components/schemas/CustomOperation" + oneOf: +- $ref: "#/components/schemas/CreateTableOperation" +- $ref: "#/components/schemas/RegisterTableOperation" +- $ref: "#/components/schemas/DropTableOperation" +- $ref: "#/components/schemas/UpdateTableOperation" +- $ref: "#/components/schemas/RenameTableOperation" +- $ref: "#/components/schemas/CreateViewOperation" +- $ref: "#/components/schemas/DropViewOperation" +- $ref: "#/components/schemas/ReplaceViewOperation" +- $ref: "#/components/schemas/RenameTableOperation" +- $ref: "#/components/schemas/CreateNamespaceOperation" +- $ref: "#/components/schemas/UpdateNamespacePropertiesOperation" +- $ref: "#/components/schemas/DropNamespaceOperation" +- $ref: "#/components/schemas/CustomOperation" + +CreateTableOperation: + required: +- operation-type +- identifier +- table-uuid +- metadata + properties: +operation-type: + $ref: "#/components/schemas/OperationType" + const: "create-table" +identifier: + $ref: "#/components/schemas/TableIdentifier" +table-uuid: + type: string + format: uuid +metadata: + $ref: "#/components/schemas/TableMetadata" + +RegisterTableOperation
Re: [PR] Proposal: IRC Events endpoint [iceberg]
c-thiel commented on code in PR #12584: URL: https://github.com/apache/iceberg/pull/12584#discussion_r2112129175 ## open-api/rest-catalog-open-api.py: ## @@ -1389,6 +1558,101 @@ class CommitTableResponse(BaseModel): metadata: TableMetadata +class EventsResponse(BaseModel): +next_page_token: Optional[PageToken] = Field(None, alias='next-page-token') +highest_processed_timestamp_ms: int = Field( +..., +alias='highest-processed-timestamp-ms', +description='The highest timestamp processed by the server when generating this response. This may not necessarily appear in the returned changes if it was filtered out.\nClients can use this value as the `after-timestamp-ms` parameter in subsequent requests to continue retrieving changes after this point.\n', +) +events: List[Event] + + +class Event(BaseModel): +event_id: str = Field( +..., +alias='event-id', +description='Unique ID of this event. Clients should perform deduplication based on this ID.', +) +request_id: str = Field( +..., alias='request-id', description='ID of the request this change belongs to.' +) +event_count: int = Field( +..., +alias='event-count', +description='Number of events in the request / batch of events', +) +timestamp_ms: int = Field( +..., +alias='timestamp-ms', +description='Timestamp when this transaction occurred (epoch milliseconds). Timestamps are not guaranteed to be unique. Typically all events in a transaction will have the same timestamp.\n', +) +actor_chain: Optional[List[Actor]] = Field( +None, +alias='actor-chain', +description='An ordered list of actors involved in the operation, with the most direct actor (the one who actually performed the operation) first, followed by delegating actors in order. For example, if a service account (actor[0]) performed an operation on behalf of a role (actor[1]) assumed by a user (actor[2]), the chain represents this delegation path.\n', +) +operation: Union[ +CreateTableOperation, +RegisterTableOperation, +DropTableOperation, +UpdateTableOperation, +RenameTableOperation, +CreateViewOperation, +DropViewOperation, +ReplaceViewOperation, +CreateNamespaceOperation, +UpdateNamespacePropertiesOperation, +DropNamespaceOperation, +CustomOperation, +] = Field(..., discriminator='operation_type') + + +class CreateTableOperation(BaseModel): +operation_type: Literal['create-table'] = Field( +..., alias='operation-type', const=True +) +identifier: TableIdentifier +table_uuid: UUID = Field(..., alias='table-uuid') +metadata: TableMetadata + + +class RegisterTableOperation(BaseModel): +operation_type: Literal['register-table'] = Field( +..., alias='operation-type', const=True +) +identifier: TableIdentifier +table_uuid: UUID = Field(..., alias='table-uuid') +metadata: TableMetadata Review Comment: Metadata is only returned for the initial `createTable` event. Other events only contain incremental updates. As this event is emitted at the time of table creation, TableMetadata is typically quite short, so I don't see a big problem storing it. REST Catalogs already store Metadata somewhere internally, and this is certainly one of the smaller ones as it is the first in the lifecycle of the table. Use-cases are outlined in the docs for this PR, I do see a value especially for Federation & Workflow triggers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: port range distribution to v2 iceberg sink [iceberg]
rodmeneses commented on code in PR #12071: URL: https://github.com/apache/iceberg/pull/12071#discussion_r2112131384 ## flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java: ## @@ -645,72 +711,135 @@ private DataStream distributeDataStream(DataStream input) { DistributionMode mode = flinkWriteConf.distributionMode(); Schema schema = table.schema(); PartitionSpec spec = table.spec(); +SortOrder sortOrder = table.sortOrder(); + LOG.info("Write distribution mode is '{}'", mode.modeName()); switch (mode) { case NONE: -if (equalityFieldIds.isEmpty()) { - return input; -} else { - LOG.info("Distribute rows by equality fields, because there are equality fields set"); - return input.keyBy(new EqualityFieldKeySelector(schema, flinkRowType, equalityFieldIds)); -} - +return distributeDataStreamByNoneDistributionMode(input, schema); case HASH: -if (equalityFieldIds.isEmpty()) { - if (table.spec().isUnpartitioned()) { -LOG.warn( -"Fallback to use 'none' distribution mode, because there are no equality fields set " -+ "and table is unpartitioned"); -return input; - } else { -if (BucketPartitionerUtil.hasOneBucketField(spec)) { - return input.partitionCustom( - new BucketPartitioner(spec), - new BucketPartitionKeySelector(spec, schema, flinkRowType)); -} else { - return input.keyBy(new PartitionKeySelector(spec, schema, flinkRowType)); -} - } -} else { - if (spec.isUnpartitioned()) { -LOG.info( -"Distribute rows by equality fields, because there are equality fields set " -+ "and table is unpartitioned"); -return input.keyBy( -new EqualityFieldKeySelector(schema, flinkRowType, equalityFieldIds)); - } else { -for (PartitionField partitionField : spec.fields()) { - Preconditions.checkState( - equalityFieldIds.contains(partitionField.sourceId()), - "In 'hash' distribution mode with equality fields set, partition field '%s' " - + "should be included in equality fields: '%s'", - partitionField, - equalityFieldColumns); -} -return input.keyBy(new PartitionKeySelector(spec, schema, flinkRowType)); - } -} - +return distributeDataStreamByHashDistributionMode(input, schema, spec); case RANGE: -if (equalityFieldIds.isEmpty()) { - LOG.warn( - "Fallback to use 'none' distribution mode, because there are no equality fields set " - + "and {}=range is not supported yet in flink", - WRITE_DISTRIBUTION_MODE); - return input; -} else { - LOG.info( - "Distribute rows by equality fields, because there are equality fields set " - + "and{}=range is not supported yet in flink", - WRITE_DISTRIBUTION_MODE); - return input.keyBy(new EqualityFieldKeySelector(schema, flinkRowType, equalityFieldIds)); -} - +return distributeDataStreamByRangeDistributionMode(input, schema, spec, sortOrder); default: throw new RuntimeException("Unrecognized " + WRITE_DISTRIBUTION_MODE + ": " + mode); } } + private DataStream distributeDataStreamByNoneDistributionMode( + DataStream input, Schema iSchema) { +if (equalityFieldIds.isEmpty()) { + return input; +} else { + LOG.info("Distribute rows by equality fields, because there are equality fields set"); + return input.keyBy(new EqualityFieldKeySelector(iSchema, flinkRowType, equalityFieldIds)); +} + } + + private DataStream distributeDataStreamByHashDistributionMode( + DataStream input, Schema iSchema, PartitionSpec partitionSpec) { +if (equalityFieldIds.isEmpty()) { + if (partitionSpec.isUnpartitioned()) { +LOG.warn( +"Fallback to use 'none' distribution mode, because there are no equality fields set " ++ "and table is unpartitioned"); +return input; + } else { +return input.keyBy(new PartitionKeySelector(partitionSpec, iSchema, flinkRowType)); + } +} else { + if (partitionSpec.isUnpartitioned()) { +LOG.info( +"Distribute rows by equality fields, because there are equality fields set " ++ "and table is unpartitioned"); +return input.keyBy(new EqualityFieldKeySelector(iSchema, flinkRowType, equalityFieldIds)); + } else { +for (PartitionField partitionField : partitionSpec.fields()) { + Preconditions.checkState( + equalityFieldIds.c
Re: [PR] Proposal: IRC Events endpoint [iceberg]
c-thiel commented on code in PR #12584: URL: https://github.com/apache/iceberg/pull/12584#discussion_r2112110143 ## open-api/rest-catalog-open-api.yaml: ## @@ -3405,6 +3488,131 @@ components: allOf: - $ref: '#/components/schemas/ScanTasks' +Actor: + type: object + description: Represents an actor that performed operations in the catalog + required: +- type +- actor-id + properties: +type: + type: string + description: The type of actor (e.g., "user", "principal", "role") +actor-id: + type: string + description: The identifier of the actor +metadata: + type: object + description: Additional metadata about the actor that may vary between implementations + additionalProperties: +type: string + example: +type: "role" +actor-id: "1234" +metadata: + permissions: "read-write" + expiration: "2025-06-07T12:00:00Z" + name: "Admin Role" +assumed-by: + type: "user" + actor-id: "4567" + metadata: +name: "Peter Cold" +email: "pe...@example.com" + +GetEventsRequest: + type: object + properties: +next-page-token: + $ref: "#/components/schemas/PageToken" +page-size: + type: integer + format: int32 + description: > +The maximum number of events to return in a single response. +If not provided, the server may choose a default page size. +after-timestamp-ms: + type: integer + format: int64 + description: > +The (server) timestamp in milliseconds to start consuming events from (inclusive). +If not provided, the first available timestamp is used. Review Comment: We do have the `next-page-token` for pagination with opaque tokens. This filter here is meant merely as a rough parameter to limit events obtained in the first request. Do you think a more precise description is missing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] INT96 timestamp is read as OffsetDateTime, not LocalDateTime [iceberg]
anesterenok commented on issue #12266: URL: https://github.com/apache/iceberg/issues/12266#issuecomment-2916672959 While I quite in favor of deprecating int96 storage, I must say that its only a projection of Iceberg types. If one uses Iceberg `timestamp` type, the value must be considered a local timestamp and processed as `LocalDateTime`. If one uses Iceberg `timestamptz` type, the value must be considered an instant and processed as `OffsetDateTime` (well, you'd better used `Instant` actually, as no zone offset could really be stored for `timestamptz`). Hence, instead of TimestampInt96Reader: `OffsetDateTime` there should actually have been 2 readers: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: port range distribution to v2 iceberg sink [iceberg]
rodmeneses commented on code in PR #12071: URL: https://github.com/apache/iceberg/pull/12071#discussion_r2112131384 ## flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergSink.java: ## @@ -645,72 +711,135 @@ private DataStream distributeDataStream(DataStream input) { DistributionMode mode = flinkWriteConf.distributionMode(); Schema schema = table.schema(); PartitionSpec spec = table.spec(); +SortOrder sortOrder = table.sortOrder(); + LOG.info("Write distribution mode is '{}'", mode.modeName()); switch (mode) { case NONE: -if (equalityFieldIds.isEmpty()) { - return input; -} else { - LOG.info("Distribute rows by equality fields, because there are equality fields set"); - return input.keyBy(new EqualityFieldKeySelector(schema, flinkRowType, equalityFieldIds)); -} - +return distributeDataStreamByNoneDistributionMode(input, schema); case HASH: -if (equalityFieldIds.isEmpty()) { - if (table.spec().isUnpartitioned()) { -LOG.warn( -"Fallback to use 'none' distribution mode, because there are no equality fields set " -+ "and table is unpartitioned"); -return input; - } else { -if (BucketPartitionerUtil.hasOneBucketField(spec)) { - return input.partitionCustom( - new BucketPartitioner(spec), - new BucketPartitionKeySelector(spec, schema, flinkRowType)); -} else { - return input.keyBy(new PartitionKeySelector(spec, schema, flinkRowType)); -} - } -} else { - if (spec.isUnpartitioned()) { -LOG.info( -"Distribute rows by equality fields, because there are equality fields set " -+ "and table is unpartitioned"); -return input.keyBy( -new EqualityFieldKeySelector(schema, flinkRowType, equalityFieldIds)); - } else { -for (PartitionField partitionField : spec.fields()) { - Preconditions.checkState( - equalityFieldIds.contains(partitionField.sourceId()), - "In 'hash' distribution mode with equality fields set, partition field '%s' " - + "should be included in equality fields: '%s'", - partitionField, - equalityFieldColumns); -} -return input.keyBy(new PartitionKeySelector(spec, schema, flinkRowType)); - } -} - +return distributeDataStreamByHashDistributionMode(input, schema, spec); case RANGE: -if (equalityFieldIds.isEmpty()) { - LOG.warn( - "Fallback to use 'none' distribution mode, because there are no equality fields set " - + "and {}=range is not supported yet in flink", - WRITE_DISTRIBUTION_MODE); - return input; -} else { - LOG.info( - "Distribute rows by equality fields, because there are equality fields set " - + "and{}=range is not supported yet in flink", - WRITE_DISTRIBUTION_MODE); - return input.keyBy(new EqualityFieldKeySelector(schema, flinkRowType, equalityFieldIds)); -} - +return distributeDataStreamByRangeDistributionMode(input, schema, spec, sortOrder); default: throw new RuntimeException("Unrecognized " + WRITE_DISTRIBUTION_MODE + ": " + mode); } } + private DataStream distributeDataStreamByNoneDistributionMode( + DataStream input, Schema iSchema) { +if (equalityFieldIds.isEmpty()) { + return input; +} else { + LOG.info("Distribute rows by equality fields, because there are equality fields set"); + return input.keyBy(new EqualityFieldKeySelector(iSchema, flinkRowType, equalityFieldIds)); +} + } + + private DataStream distributeDataStreamByHashDistributionMode( + DataStream input, Schema iSchema, PartitionSpec partitionSpec) { +if (equalityFieldIds.isEmpty()) { + if (partitionSpec.isUnpartitioned()) { +LOG.warn( +"Fallback to use 'none' distribution mode, because there are no equality fields set " ++ "and table is unpartitioned"); +return input; + } else { +return input.keyBy(new PartitionKeySelector(partitionSpec, iSchema, flinkRowType)); + } +} else { + if (partitionSpec.isUnpartitioned()) { +LOG.info( +"Distribute rows by equality fields, because there are equality fields set " ++ "and table is unpartitioned"); +return input.keyBy(new EqualityFieldKeySelector(iSchema, flinkRowType, equalityFieldIds)); + } else { +for (PartitionField partitionField : partitionSpec.fields()) { + Preconditions.checkState( + equalityFieldIds.c
Re: [PR] Spark: Add 'skip_file_list' option to RewriteTablePathProcedure for optional file-list generation [iceberg]
manuzhang commented on code in PR #12844: URL: https://github.com/apache/iceberg/pull/12844#discussion_r2112148631 ## api/src/main/java/org/apache/iceberg/actions/RewriteTablePath.java: ## @@ -86,6 +86,16 @@ public interface RewriteTablePath extends Action
Re: [PR] Docs, Flink: Fix equality fields requirement in upsert mode [iceberg]
pvary merged PR #13127: URL: https://github.com/apache/iceberg/pull/13127 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs, Flink: Fix equality fields requirement in upsert mode [iceberg]
pvary commented on PR #13127: URL: https://github.com/apache/iceberg/pull/13127#issuecomment-2916483984 Merged to main. Thanks @manuzhang for the PR and @mxm for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: add asan and ubsan support to cmake [iceberg-cpp]
raulcd commented on code in PR #107: URL: https://github.com/apache/iceberg-cpp/pull/107#discussion_r2112029921 ## .github/workflows/sanitizer_test.yml: ## @@ -0,0 +1,63 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: ASAN and UBSAN Tests + +on: + push: +branches: + - '**' + - '!dependabot/**' +tags: + - '**' + pull_request: + +concurrency: + group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} + cancel-in-progress: true + +permissions: + contents: read + +jobs: + sanitizer-test: +name: "ASAN and UBSAN Tests" +runs-on: ubuntu-24.04 +strategy: + fail-fast: false +steps: + - name: Checkout iceberg-cpp +uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 + - name: Configure and Build with ASAN & UBSAN +run: | + mkdir build && cd build + cmake .. -DCMAKE_BUILD_TYPE=Debug -DICEBERG_ENABLE_ASAN=ON -DICEBERG_ENABLE_UBSAN=ON + cmake --build . --verbose + - name: Run Tests +working-directory: build +env: + ASAN_OPTIONS: log_path=out.log:detect_leaks=1:symbolize=1:strict_string_checks=1:halt_on_error=0:detect_container_overflow=0 Review Comment: I'll let @wgtmac and yourself decide but I would rather merge a failing CI job on this case, knowing that it has to be fixed on a subsequent PR, than a false positive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Proposal: IRC Events endpoint [iceberg]
c-thiel commented on code in PR #12584: URL: https://github.com/apache/iceberg/pull/12584#discussion_r2112105575 ## open-api/rest-catalog-open-api.yaml: ## @@ -3041,6 +3096,34 @@ components: - $ref: '#/components/schemas/AddViewVersionUpdate' - $ref: '#/components/schemas/SetCurrentViewVersionUpdate' +OperationType: + type: string + description: > +Defines the type of operation, either a standard operation defined by the Iceberg REST API +specification or a custom operation specific to an implementation. + anyOf: + - enum: +- create-table Review Comment: The matrix approach that you propose can lead to invalid combinations, already today, register wouldn't be easy to distinguish from create. Also "replace" is something that is only valid for views and not namespaces for example. Because of this, I would prefer to stick to the current long list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark, Avro and Iceberg Timestamp, Timestamp NTZ definition need more clarifications [iceberg]
RussellSpitzer commented on issue #12751: URL: https://github.com/apache/iceberg/issues/12751#issuecomment-2916840926 Is this an issue for Datafiles written with Avro Data instead of Parquet? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Infra: Set 1.9.1 to Latest Release [iceberg]
RussellSpitzer merged PR #13177: URL: https://github.com/apache/iceberg/pull/13177 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Fix a cast that is too narrow [iceberg]
angelo-DNAStack commented on PR #12743: URL: https://github.com/apache/iceberg/pull/12743#issuecomment-2917511977 Checking in on this PR again, please let me know if there are any additional changes required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Adds 1.9.1 Versioned JavaDocs [iceberg]
RussellSpitzer commented on PR #13173: URL: https://github.com/apache/iceberg/pull/13173#issuecomment-2917518749 Thanks @ajantha-bhat and @danielcweeks for reviewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Adds 1.9.1 Versioned JavaDocs [iceberg]
RussellSpitzer merged PR #13173: URL: https://github.com/apache/iceberg/pull/13173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Infra: Set 1.9.1 to Latest Release [iceberg]
RussellSpitzer commented on PR #13177: URL: https://github.com/apache/iceberg/pull/13177#issuecomment-2917517509 Thanks @ajantha-bhat , @flyrain + @singhpk234 for reviews! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Site: Updates for 1.9.1 Release [iceberg]
RussellSpitzer merged PR #13176: URL: https://github.com/apache/iceberg/pull/13176 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Site: Updates for 1.9.1 Release [iceberg]
RussellSpitzer commented on PR #13176: URL: https://github.com/apache/iceberg/pull/13176#issuecomment-2917516161 Thanks @ajantha-bhat + @danielcweeks for review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Site: Add Versioned Docs for 1.9.1 [iceberg]
RussellSpitzer merged PR #13172: URL: https://github.com/apache/iceberg/pull/13172 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Site: Add Versioned Docs for 1.9.1 [iceberg]
RussellSpitzer commented on PR #13172: URL: https://github.com/apache/iceberg/pull/13172#issuecomment-2917520514 Thanks @ajantha-bhat + @danielcweeks for reviewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]
RussellSpitzer commented on code in PR #12254: URL: https://github.com/apache/iceberg/pull/12254#discussion_r2112716340 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java: ## @@ -107,9 +115,12 @@ public abstract class TestRemoveOrphanFilesAction extends TestBase { protected Map properties; @Parameter private int formatVersion; - @Parameters(name = "formatVersion = {0}") + @Parameter(index = 1) + private boolean usePrefixListing; + + @Parameters(name = "formatVersion = {0}, usePrefixListing = {1}") protected static List parameters() { -return Arrays.asList(2, 3); +return Arrays.asList(new Object[] {2, true}, new Object[] {2, false}, new Object[] {3, false}); Review Comment: I mostly just don't want to have another explicit "version number" I'm fine if we even only test on the default version number really. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]
RussellSpitzer commented on code in PR #12254: URL: https://github.com/apache/iceberg/pull/12254#discussion_r2112725064 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java: ## @@ -344,6 +374,8 @@ public void testWapFilesAreKept() { // wap write df.select("c1", "c2", "c3").write().format("iceberg").mode("append").save(tableLocation); +spark.conf().unset(SparkSQLProperties.WAP_ID); Review Comment: We should be using withSQLConf then for this test then like ```java withSQLConf( ImmutableMap.of(confName, "2m"), () -> { SparkConfParser parser = new SparkConfParser(spark, table, ImmutableMap.of()); Duration duration = parser.durationConf().sessionConf(confName).parseOptional(); assertThat(duration).hasMinutes(2); }); ``` or a cloned Spark Session -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] SPARK: Remove dependency on hadoop's filesystem class from remove orphan files [iceberg]
RussellSpitzer commented on code in PR #12254: URL: https://github.com/apache/iceberg/pull/12254#discussion_r2112725064 ## spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java: ## @@ -344,6 +374,8 @@ public void testWapFilesAreKept() { // wap write df.select("c1", "c2", "c3").write().format("iceberg").mode("append").save(tableLocation); +spark.conf().unset(SparkSQLProperties.WAP_ID); Review Comment: We should be using withSQLConf then for this test then like ```java withSQLConf(Map.of(SparkSQLProperties.WAP_ID, "1"), () -> { // wap write df.select("c1", "c2", "c3").write().format("iceberg").mode("append").save(tableLocation); Dataset resultDF = spark.read().format("iceberg").load(tableLocation); List actualRecords = resultDF.as(Encoders.bean(ThreeColumnRecord.class)).collectAsList(); // TODO: currently fails because DVs delete stuff from WAP branch assertThat(actualRecords) .as("Should not return data from the staged snapshot") .isEqualTo(records); }); ``` or a cloned Spark Session -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] fix(Catalog): Handle NotFound exception for missing metadata file [iceberg]
nastra commented on code in PR #13143: URL: https://github.com/apache/iceberg/pull/13143#discussion_r2112280095 ## core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java: ## @@ -959,6 +959,16 @@ public void testLoadMissingTable() { .hasMessageStartingWith("Table does not exist: ns.tbl"); } + @Test + public void testLoadTableWithMissingMetadatafile() { +C catalog = catalog(); + +assertThat(catalog.tableExists(TBL)).as("Table should not exist").isFalse(); +assertThatThrownBy(() -> catalog.loadTable(TBL)) +.isInstanceOf(NoSuchTableException.class) Review Comment: the test doesn't reproduce the issue you're trying to repro and it passes without applying the fix that is being proposed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Docs: Adds 1.9.1 Versioned JavaDocs [iceberg]
RussellSpitzer opened a new pull request, #13173: URL: https://github.com/apache/iceberg/pull/13173 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: streaming-skip-overwrite-snapshots fix [iceberg]
huaxingao merged PR #13168: URL: https://github.com/apache/iceberg/pull/13168 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.4: streaming-skip-overwrite-snapshots fix [iceberg]
huaxingao commented on PR #13168: URL: https://github.com/apache/iceberg/pull/13168#issuecomment-2916961579 Merged. Thanks @wypoon for the PR! Thanks @singhpk234 for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Site: Add Versioned Docs for 1.9.1 [iceberg]
RussellSpitzer opened a new pull request, #13172: URL: https://github.com/apache/iceberg/pull/13172 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] ERROR: Procedure system.refresh_table not found [iceberg]
vishwanthp12 opened a new issue, #13171: URL: https://github.com/apache/iceberg/issues/13171 Command I'm using in Spark-sql: CALL iceberg_catalog.system.refresh_table('iceberg_catalog.$db.$table'); I'm using JDBC catalog configured on MariaDB (MYSQL) here is my spark-default.conf: # Iceberg extensions spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions # JDBC catalog for MariaDB spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.iceberg_catalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog spark.sql.catalog.iceberg_catalog.uri=jdbc:mysql://localhost:3306/iceberg_catalog spark.sql.catalog.iceberg_catalog.jdbc.user=iceberg spark.sql.catalog.iceberg_catalog.jdbc.password=@xxx spark.sql.catalog.iceberg_catalog.warehouse=s3://xxx-xxx-x/spark-test/iceberg-warehouse spark.sql.catalog.iceberg_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO The error I get in CALL command is : ERROR: Procedure system.refresh_table not found What Am I missing ? I also tried switching to hadoop catalog. Still same issue. Can someone share their inputs on this issue please ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Add basic classes for writing table format-version 4 [iceberg]
ajantha-bhat commented on code in PR #13123: URL: https://github.com/apache/iceberg/pull/13123#discussion_r2112311313 ## api/src/test/java/org/apache/iceberg/TestHelpers.java: ## @@ -54,7 +54,7 @@ public class TestHelpers { private TestHelpers() {} - public static final int MAX_FORMAT_VERSION = 3; + public static final int MAX_FORMAT_VERSION = 4; Review Comment: I think it’s worth focusing on what’s technically correct rather than solely relying on what has been done previously. Since we publish nightly builds, there’s a chance users could start creating v4 tables before any spec is officially finalized. With the 1.10.0 discussions already underway, this change could potentially be included in a release without any approved v4 spec. My suggestion was simply to finalize at least one v4 spec change before beginning v4 development on the main branch, so that users have a clear understanding of what features come with enabling v4. That said, I may be overthinking this or being too cautious 😅. Please feel free to go with the broader consensus from the community on this PR (other reviewers opinion). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Spec: Mark version 3 as completed [iceberg]
ajantha-bhat opened a new pull request, #13175: URL: https://github.com/apache/iceberg/pull/13175 voted on mailing list: https://lists.apache.org/thread/9oncq0j0222nrm6scvm89xdq3gozg24p -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org