[PR] Kafka Connect: Add config to route to tables using topic name [iceberg]
xiasongh opened a new pull request, #11313: URL: https://github.com/apache/iceberg/pull/11313 Add a new config `iceberg.tables.route-pattern` to dynamically route to Iceberg tables using Kafka topic name Closes #11163 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Add config to route to tables using topic name [iceberg]
xiasongh commented on code in PR #11313: URL: https://github.com/apache/iceberg/pull/11313#discussion_r1798594599 ## kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java: ## @@ -133,6 +145,20 @@ private String extractRouteValue(Object recordValue, String routeField) { return routeValue == null ? null : routeValue.toString(); } + private String formatRoutePattern(SinkRecord record, String routePattern) { +if (routePattern == null) { + return null; +} + +String topicName = record.topic(); +if (topicName == null) { + return null; +} + +// replace topic namespace separator +return routePattern.replace("{topic}", topicName.replace(".", "_")); Review Comment: > topicName.replace(".", "_") I use AWS Glue catalog, which doesn't support nested namespaces. Topic names with more than 1 `.` would be invalid table names The Debezium JDBC sink connector [0] also does this, so it's not totally unheard of It probably makes most sense to turn this into it's own config, maybe something like `iceberg.tables.route-pattern.namespace-separator`? [0] https://debezium.io/documentation/reference/stable/connectors/jdbc.html#jdbc-property-table-naming-strategy ## kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java: ## @@ -133,6 +145,20 @@ private String extractRouteValue(Object recordValue, String routeField) { return routeValue == null ? null : routeValue.toString(); } + private String formatRoutePattern(SinkRecord record, String routePattern) { +if (routePattern == null) { + return null; +} + +String topicName = record.topic(); +if (topicName == null) { + return null; +} + +// replace topic namespace separator +return routePattern.replace("{topic}", topicName.replace(".", "_")); Review Comment: > topicName.replace(".", "_") I use AWS Glue catalog, which doesn't support nested namespaces. Topic names with more than 1 `.` would be invalid table names The Debezium JDBC sink connector [0] also does this, so it's not totally unheard of It probably makes most sense to turn this into it's own config, maybe something like `iceberg.tables.route-pattern.namespace-separator`? [0] https://debezium.io/documentation/reference/stable/connectors/jdbc.html#jdbc-property-table-naming-strategy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Add config to route to tables using topic name [iceberg]
xiasongh commented on code in PR #11313: URL: https://github.com/apache/iceberg/pull/11313#discussion_r1798594599 ## kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java: ## @@ -133,6 +145,20 @@ private String extractRouteValue(Object recordValue, String routeField) { return routeValue == null ? null : routeValue.toString(); } + private String formatRoutePattern(SinkRecord record, String routePattern) { +if (routePattern == null) { + return null; +} + +String topicName = record.topic(); +if (topicName == null) { + return null; +} + +// replace topic namespace separator +return routePattern.replace("{topic}", topicName.replace(".", "_")); Review Comment: > topicName.replace(".", "_") I use AWS Glue catalog, which doesn't support nested namespaces. Topic names with more than 1 `.` would be invalid table names, so one thing we can do is replace all the `.` characters The Debezium JDBC sink connector [0] also does this, so it's not totally unheard of It probably makes most sense to turn this into it's own config, maybe something like `iceberg.tables.route-pattern.namespace-separator`? Thoughts? [0] https://debezium.io/documentation/reference/stable/connectors/jdbc.html#jdbc-property-table-naming-strategy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Add config to route to tables using topic name [iceberg]
xiasongh commented on code in PR #11313: URL: https://github.com/apache/iceberg/pull/11313#discussion_r1798594599 ## kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java: ## @@ -133,6 +145,20 @@ private String extractRouteValue(Object recordValue, String routeField) { return routeValue == null ? null : routeValue.toString(); } + private String formatRoutePattern(SinkRecord record, String routePattern) { +if (routePattern == null) { + return null; +} + +String topicName = record.topic(); +if (topicName == null) { + return null; +} + +// replace topic namespace separator +return routePattern.replace("{topic}", topicName.replace(".", "_")); Review Comment: > topicName.replace(".", "_") I use AWS Glue catalog, which doesn't support nested namespaces. Topic names with more than 1 `.` would be invalid table names, so one thing we can do is replace all the `.` characters The Debezium JDBC sink connector [0] also does this, so it's not totally unheard of It probably makes most sense to turn this into it's own config, maybe something like `iceberg.tables.route-pattern.namespace-separator`? [0] https://debezium.io/documentation/reference/stable/connectors/jdbc.html#jdbc-property-table-naming-strategy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Kafka Connect: route to table using topic name [iceberg]
xiasongh commented on issue #11163: URL: https://github.com/apache/iceberg/issues/11163#issuecomment-2409142288 @bryanck Sorry for the delay, was able to give this a shot. Please have a look. I'm not at all familiar with Java so forgive me if I did anything silly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Flink: Add RowConverter for Iceberg Source [iceberg]
abharath9 opened a new issue, #11312: URL: https://github.com/apache/iceberg/issues/11312 ### Feature Request / Improvement Currently we can't create views on top of IcebergSource DataStreams directly. We need to convert the RowData to Row explicitly using map function. I thought creating a RowConverter to convert RowData to Row and return IcebergSource would be a good idea. This approach enables the use of the Iceberg schema to create a DataStream and further create a view on top of it. What are your thoughts? ### Query engine Flink ### Willingness to contribute - [X] I can contribute this improvement/feature independently - [ ] I would be willing to contribute this improvement/feature with guidance from the Iceberg community - [ ] I cannot contribute this improvement/feature at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink: Add RowConverter for Iceberg Source [iceberg]
abharath9 commented on issue #11312: URL: https://github.com/apache/iceberg/issues/11312#issuecomment-2409092087 Implemented this change and created a pr. Can i get a review on this? https://github.com/apache/iceberg/pull/11301 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.61.2 to 1.65.2 [iceberg-go]
dependabot[bot] commented on PR #166: URL: https://github.com/apache/iceberg-go/pull/166#issuecomment-2408876440 Superseded by #170. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.61.2 to 1.65.2 [iceberg-go]
dependabot[bot] closed pull request #166: build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.61.2 to 1.65.2 URL: https://github.com/apache/iceberg-go/pull/166 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.61.2 to 1.65.3 [iceberg-go]
dependabot[bot] opened a new pull request, #170: URL: https://github.com/apache/iceberg-go/pull/170 Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.61.2 to 1.65.3. Commits https://github.com/aws/aws-sdk-go-v2/commit/071b493afc547a04084be261af44ba204e97c612";>071b493 Release 2024-10-11 https://github.com/aws/aws-sdk-go-v2/commit/c70d0118c74a13c807b16b45fcbc8b82e061da30";>c70d011 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/f98b7e121460ce1c7e29f916c60d1f3f9f8895e8";>f98b7e1 Update API model https://github.com/aws/aws-sdk-go-v2/commit/10c8fe26fbe46b3abd5ee66d7ecbbabae4c95b46";>10c8fe2 Remove requirement of internal tool to check for version on AWS models (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2832";>#2832) https://github.com/aws/aws-sdk-go-v2/commit/28d943f7f66c7095685c8d57dea18944fc3b5c22";>28d943f S3 ReplicationRuleFilter and LifecycleRuleFilter shapes are being changed fro... https://github.com/aws/aws-sdk-go-v2/commit/b34ecd46bb2e14f2786934ef34ed7747c5fe89a8";>b34ecd4 Release 2024-10-10 https://github.com/aws/aws-sdk-go-v2/commit/ead7ba38611404d5c32aa92c57cba8057b3cf8a0";>ead7ba3 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/26c58a0c6f861e7be7d6439bceea77bec71fc97a";>26c58a0 Update API model https://github.com/aws/aws-sdk-go-v2/commit/bcff11552060a39aa275ffda9714ecfb6e2572ab";>bcff115 Release 2024-10-09 https://github.com/aws/aws-sdk-go-v2/commit/527244530ff3208d45f9b34bde45fac2bd300476";>5272445 Regenerated Clients Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.61.2...service/s3/v1.65.3";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] build(deps): bump github.com/aws/aws-sdk-go-v2/service/glue from 1.99.2 to 1.100.2 [iceberg-go]
dependabot[bot] opened a new pull request, #171: URL: https://github.com/apache/iceberg-go/pull/171 Bumps [github.com/aws/aws-sdk-go-v2/service/glue](https://github.com/aws/aws-sdk-go-v2) from 1.99.2 to 1.100.2. Commits https://github.com/aws/aws-sdk-go-v2/commit/0cbb5aa17f9078cb45dc0e82d3e1d0abee3744a9";>0cbb5aa Release 2024-10-08 https://github.com/aws/aws-sdk-go-v2/commit/54c1dd6c74185b0c7df78159ec4d5b2c27e9e280";>54c1dd6 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/2cde144eedda9f509141751c3011ca64a6b6528e";>2cde144 Update endpoints model https://github.com/aws/aws-sdk-go-v2/commit/67fbd35762ef8694839df209714d2ec2c33d3df9";>67fbd35 Update API model https://github.com/aws/aws-sdk-go-v2/commit/aa04330cb6978ccb6a7bb3e198b3fb21abbd6333";>aa04330 Allow non-nil but empty headers (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2826";>#2826) https://github.com/aws/aws-sdk-go-v2/commit/5a4e5bb42c08ff5a4e0e601a7461c8466565e44e";>5a4e5bb add feature tracking for cbor protocol (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2821";>#2821) https://github.com/aws/aws-sdk-go-v2/commit/183987cda0c2487a1b25c8e9cbf8dba510046c73";>183987c add annotations to deprecated services and introduce codegen integration for ... https://github.com/aws/aws-sdk-go-v2/commit/b737dc9eb14847cd97d3b30ad6a1394efd73245b";>b737dc9 Release 2024-10-07 https://github.com/aws/aws-sdk-go-v2/commit/7279a51bbcd597f4aa7aeeb599c017d3d1679fb6";>7279a51 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/a1b1f5a17c687371cc53c5dfbb2bf5ff467ff51a";>a1b1f5a Update endpoints model Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/service/glue/v1.99.2...service/glue/v1.100.2";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] docs:README uses iceberg-rust instead of we [iceberg-rust]
caicancai opened a new pull request, #667: URL: https://github.com/apache/iceberg-rust/pull/667 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates [iceberg]
zhongyujiang commented on PR #11261: URL: https://github.com/apache/iceberg/pull/11261#issuecomment-2408925671 @amogh-jahagirdar Thanks for reviewing, tests updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] chore(deps): Bump crate-ci/typos from 1.25.0 to 1.26.0 [iceberg-rust]
dependabot[bot] opened a new pull request, #668: URL: https://github.com/apache/iceberg-rust/pull/668 Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.25.0 to 1.26.0. Release notes Sourced from https://github.com/crate-ci/typos/releases";>crate-ci/typos's releases. v1.26.0 [1.26.0] - 2024-10-07 Compatibility (pre-commit) Requires 3.2+ Fixes (pre-commit) Resolve deprecations in 4.0 about deprecated stage names Changelog Sourced from https://github.com/crate-ci/typos/blob/master/CHANGELOG.md";>crate-ci/typos's changelog. [1.26.0] - 2024-10-07 Compatibility (pre-commit) Requires 3.2+ Fixes (pre-commit) Resolve deprecations in 4.0 about deprecated stage names Commits https://github.com/crate-ci/typos/commit/6802cc60d4e7f78b9d5454f6cf3935c042d5e1e3";>6802cc6 chore: Release https://github.com/crate-ci/typos/commit/caa55026aee3d2cdcaf1f9b0c258651dbb01c283";>caa5502 docs: Update changelog https://github.com/crate-ci/typos/commit/2114c1924169510820bc12e59427851514624ac2";>2114c19 Merge pull request https://redirect.github.com/crate-ci/typos/issues/1114";>#1114 from tobiasraabe/patch-1 https://github.com/crate-ci/typos/commit/9de7b2c6bed6e32c6b34ed91702ac6eaba138a99";>9de7b2c Updates stage names in .pre-commit-hooks.yaml. https://github.com/crate-ci/typos/commit/14f49f455cf3b6a38841665e82c3b9135b91c929";>14f49f4 Merge pull request https://redirect.github.com/crate-ci/typos/issues/1105";>#1105 from crate-ci/renovate/unicode-width-0.x https://github.com/crate-ci/typos/commit/58ffa4baefb10b607bbc30bd16f7fe8a4446a643";>58ffa4b Merge pull request https://redirect.github.com/crate-ci/typos/issues/1108";>#1108 from crate-ci/renovate/stable-1.x https://github.com/crate-ci/typos/commit/003cb769377a25a6c659c67429585644c5321348";>003cb76 chore(deps): Update dependency STABLE to v1.81.0 https://github.com/crate-ci/typos/commit/bc00184a2367b7d354946b042483a30d92e012e9";>bc00184 chore(deps): Update Rust crate unicode-width to 0.2.0 See full diff in https://github.com/crate-ci/typos/compare/v1.25.0...v1.26.0";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] [WIP] Core: Prototype for DVs in V3 [iceberg]
aokolnychyi commented on code in PR #11302: URL: https://github.com/apache/iceberg/pull/11302#discussion_r1798616398 ## api/src/main/java/org/apache/iceberg/DataFile.java: ## @@ -98,12 +98,23 @@ public interface DataFile extends ContentFile { Types.NestedField SORT_ORDER_ID = optional(140, "sort_order_id", IntegerType.get(), "Sort order ID"); Types.NestedField SPEC_ID = optional(141, "spec_id", IntegerType.get(), "Partition spec ID"); + Types.NestedField REFERENCED_DATA_FILE = Review Comment: Follows the proposed spec, reserving 142 for row lineage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798617437 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at Review Comment: True, we may keep this generic for referencing manifests in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798622067 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` Review Comment: While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible. Having the magic number allows us to cross check the serialization format without reading the footer and may help debug issues with offsets. It will also be useful if we add more serialization formats in the future. I agree it is unlikely we will be able to successfully deserialize the rest of the content if the offset is invalid, but still. If we end up in that situation, it would mean there was an ugly bug and having more metadata will only help. Overall, it does seem like a reasonable sanity check to me, similar to magic numbers in zstd and gzip. We once had to debug issues with bit flips while reading manifests. There was no easy way to prove we didn't corrupt the files and it was a faulty disk. The CRC check would catch those and save a ton of time. I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout. The length and byte orders are controversial. There is no merit in those beyond compatibility with Delta Lake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798623532 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. + +The blob metadata must include the following properties: + +* `referenced-data-file`: location of the data file the delete vector applies Review Comment: The cardinality is part of making these delete files self-describing and is up for discussion. I can image some maintenance operations compacting DV files without touching the rest of the metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798622661 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian Review Comment: I think a good question to ask the community is how many vendors/engines would be interested to potentially reuse the code if they support both Iceberg and Delta. Delta DVs are widely used at Databricks, but it is hard to tell about other engines. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798623532 ## format/puffin-spec.md: ## @@ -123,6 +123,54 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector (bitmap) that represents the positions of rows in a +file that are deleted. A set bit at position P indicates that the row at +position P is deleted. + +The vector supports positive 64-bit positions (the most significant bit must be +0), but is optimized for cases where most positions fit in 32 bits by using a +collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a +32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using +the least significant 4 bytes. For each key in the set of positions, a 32-bit +Roaring bitmap is maintained to store a set of 32-bit sub-positions for that +key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes (the +sub-position) are tested for inclusion in the bitmap. If a bitmap is not found +for the key, then it is not set. + +The serialized blob contains: +* The length of the vector and magic bytes stored as 4 bytes, big-endian +* A 4-byte magic sequence, `D1 D3 39 64` +* The vector, serialized as described below +* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian + +The position vector is serialized using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +Note that the length and CRC fields are stored using big-endian, but the +Roaring bitmap format uses little-endian values. Big endian values were chosen +for compatibility with existing deletion vectors. + +The blob metadata must include the following properties: + +* `referenced-data-file`: location of the data file the delete vector applies Review Comment: The cardinality is part of making these delete files self-describing and is up for discussion. I can image some maintenance operations compacting DV files without touching the rest of the metadata (I speculate). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on code in PR #11238: URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798624051 ## format/puffin-spec.md: ## @@ -123,6 +123,44 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. + `delete-vector-v1` blob type + +A serialized delete vector that represents the positions of rows in a file that +are deleted. A set bit at position P indicates that the row at position P is +deleted. + +The bitmap supports positive 64-bit positions, but is optimized for cases where +most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps. +64-bit positions are divided into a 32-bit "key" using the most significant 4 +bytes and a 32-bit position using the least significant 4 bytes. For each key +in the set of positions, a 32-bit Roaring bitmap is maintained to store a set +of 32-bit positions for that key. + +To test whether a certain position is set, its most significant 4 bytes (the +key) are used to find a 32-bit bitmap and the least significant 4 bytes are +tested for inclusion in the bitmap. If a bitmap is not found for the key, then +it is not set. + +The serialized blob starts with a 4-byte magic sequence, `D1D33964` (1681511377 +stored as 4 bytes, little-endian). Following the magic bytes is the serialized +collection of bitmaps. The collection is stored using the Roaring bitmap +["portable" format][roaring-bitmap-portable-serialization]. This representation +consists of: + +* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian +* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: +- The key stored as 4 bytes, little-endian +- A [32-bit Roaring bitmap][roaring-bitmap-general-layout] + +The blob metadata must include the following properties: + +* `referenced-data-file`: location of the data file the delete vector applies to +* `cardinality`: number of deleted rows (set positions) in the delete vector Review Comment: I am +1 for exploring what it would take to make those fields optional. My opinion would depend on the amount of work needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]
aokolnychyi commented on PR #11238: URL: https://github.com/apache/iceberg/pull/11238#issuecomment-2409353898 PR #11302 contains a sample implementation of this spec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
singhpk234 commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1795908515 ## core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java: ## @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; +import org.apache.iceberg.ContentFile; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.FileContent; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.GenericDataFile; +import org.apache.iceberg.GenericDeleteFile; +import org.apache.iceberg.Metrics; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.SingleValueParser; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +public class RESTContentFileParser { + private static final String SPEC_ID = "spec-id"; + private static final String CONTENT = "content"; + private static final String FILE_PATH = "file-path"; + private static final String FILE_FORMAT = "file-format"; + private static final String PARTITION = "partition"; + private static final String RECORD_COUNT = "record-count"; + private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes"; + private static final String COLUMN_SIZES = "column-sizes"; + private static final String VALUE_COUNTS = "value-counts"; + private static final String NULL_VALUE_COUNTS = "null-value-counts"; + private static final String NAN_VALUE_COUNTS = "nan-value-counts"; + private static final String LOWER_BOUNDS = "lower-bounds"; + private static final String UPPER_BOUNDS = "upper-bounds"; + private static final String KEY_METADATA = "key-metadata"; + private static final String SPLIT_OFFSETS = "split-offsets"; + private static final String EQUALITY_IDS = "equality-ids"; + private static final String SORT_ORDER_ID = "sort-order-id"; + + private RESTContentFileParser() {} + + public static String toJson(ContentFile contentFile) { +return JsonUtil.generate( +generator -> RESTContentFileParser.toJson(contentFile, generator), false); + } + + public static void toJson(ContentFile contentFile, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(contentFile != null, "Invalid content file: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); + +generator.writeNumberField(SPEC_ID, contentFile.specId()); +generator.writeStringField(CONTENT, contentFile.content().name()); +generator.writeStringField(FILE_PATH, contentFile.path().toString()); +generator.writeStringField(FILE_FORMAT, contentFile.format().name()); + +generator.writeFieldName(PARTITION); + +// TODO at the time of serialization we dont have the partition spec we just have spec id. +// we will need to get the spec from table metadata using spec id. +// or we will need to send parition spec, put null here for now until refresh +SingleValueParser.toJson(null, contentFile.partition(), generator); + +generator.writeNumberField(FILE_SIZE_IN_BYTES, contentFile.fileSizeInBytes()); + +metricsToJson(contentFile, generator); + +if (contentFile.keyMetadata() != null) { + generator.writeFieldName(KEY_METADATA); + SingleValueParser.toJson(DataFile.KEY_METADATA.type(), contentFile.keyMetadata(), generator); +} + +if (contentFile.splitOffsets() != null) { + JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), generator); +} + +if (contentFile.equalityFieldIds() != null) { + JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), generator); +} + +if (contentFile.sortOrderId() != null) { + generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId()); +} + +generator.writeEndObject(); + } + + public static ContentFile fromJson(JsonNode jsonNode) { +Preconditions.checkArgument(jsonNode != null, "Inv
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
singhpk234 commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798632724 ## core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java: ## @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; +import org.apache.iceberg.ContentFile; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.FileContent; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.GenericDataFile; +import org.apache.iceberg.GenericDeleteFile; +import org.apache.iceberg.Metrics; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.SingleValueParser; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +public class RESTContentFileParser { + private static final String SPEC_ID = "spec-id"; + private static final String CONTENT = "content"; + private static final String FILE_PATH = "file-path"; + private static final String FILE_FORMAT = "file-format"; + private static final String PARTITION = "partition"; + private static final String RECORD_COUNT = "record-count"; + private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes"; + private static final String COLUMN_SIZES = "column-sizes"; + private static final String VALUE_COUNTS = "value-counts"; + private static final String NULL_VALUE_COUNTS = "null-value-counts"; + private static final String NAN_VALUE_COUNTS = "nan-value-counts"; + private static final String LOWER_BOUNDS = "lower-bounds"; + private static final String UPPER_BOUNDS = "upper-bounds"; + private static final String KEY_METADATA = "key-metadata"; + private static final String SPLIT_OFFSETS = "split-offsets"; + private static final String EQUALITY_IDS = "equality-ids"; + private static final String SORT_ORDER_ID = "sort-order-id"; + + private RESTContentFileParser() {} + + public static String toJson(ContentFile contentFile) { +return JsonUtil.generate( +generator -> RESTContentFileParser.toJson(contentFile, generator), false); + } + + public static void toJson(ContentFile contentFile, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(contentFile != null, "Invalid content file: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); + +generator.writeNumberField(SPEC_ID, contentFile.specId()); +generator.writeStringField(CONTENT, contentFile.content().name()); +generator.writeStringField(FILE_PATH, contentFile.path().toString()); +generator.writeStringField(FILE_FORMAT, contentFile.format().name()); + +generator.writeFieldName(PARTITION); + +// TODO at the time of serialization we dont have the partition spec we just have spec id. +// we will need to get the spec from table metadata using spec id. +// or we will need to send parition spec, put null here for now until refresh +SingleValueParser.toJson(null, contentFile.partition(), generator); + +generator.writeNumberField(FILE_SIZE_IN_BYTES, contentFile.fileSizeInBytes()); + +metricsToJson(contentFile, generator); + +if (contentFile.keyMetadata() != null) { + generator.writeFieldName(KEY_METADATA); + SingleValueParser.toJson(DataFile.KEY_METADATA.type(), contentFile.keyMetadata(), generator); +} + +if (contentFile.splitOffsets() != null) { + JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), generator); +} + +if (contentFile.equalityFieldIds() != null) { + JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), generator); +} + +if (contentFile.sortOrderId() != null) { + generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId()); +} + +generator.writeEndObject(); + } + + public static ContentFile fromJson(JsonNode jsonNode) { +Preconditions.checkArgument(jsonNode != null, "Inv
Re: [PR] Spark: Merge new position deletes with old deletes during writing [iceberg]
amogh-jahagirdar commented on code in PR #11273: URL: https://github.com/apache/iceberg/pull/11273#discussion_r1798634769 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java: ## @@ -185,6 +196,7 @@ public void commit(WriterCommitMessage[] messages) { int addedDataFilesCount = 0; int addedDeleteFilesCount = 0; + int removedDeleteFilesCount = 0; Review Comment: Good point, it's not being used but I'd rather include it since it will be useful especially as this is a new thing we're adding. I'll update the logs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
emkornfield commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1798450250 ## format/spec.md: ## @@ -841,19 +855,45 @@ Notes: ## Delete Formats -This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are not supported in v1. +This section details how to encode row-level deletes in Iceberg delete files. Row-level deletes are added by v2 and are not supported in v1. Deletion vectors are added in v3 and are not supported in v2 or earlier. Position delete files must not be added to v3 tables, but existing position delete files are valid. + +There are three types of row-level deletes: +* Deletion vectors (DVs) identify deleted rows within a single referenced data file by position in a bitmap +* Position delete files identify deleted rows by file location and row position (**deprecated**) +* Equality delete files identify deleted rows by the value of one or more columns + +Deletion vectors are a binary representation of deletes for a single data file that is more efficient at execution time than position delete files. Unlike equality or position delete files, there can be at most one deletion vector for a given data file in a table. Writers must ensure that there is at most one deletion vector per data file and must merge new deletes with existing vectors or position delete files. + +Row-level delete files (both equality and position delete files) are valid Iceberg data files: files must use valid Iceberg formats, schemas, and column projection. It is recommended that these delete files are written using the table's default file format. + +Row-level delete files and deletion vectors are tracked by manifests. A separate set of manifests is used for delete files and DVs, but the same manifest schema is used for both data and delete manifests. Deletion vectors are tracked individually by file location, offset, and length within the containing file. Deletion vector metadata must include the referenced data file. + +Both position and equality delete files allow encoding deleted row values with a delete. This can be used to reconstruct a stream of changes to a table. + -Row-level delete files are valid Iceberg data files: files must use valid Iceberg formats, schemas, and column projection. It is recommended that delete files are written using the table's default file format. +### Deletion Vectors -Row-level delete files are tracked by manifests, like data files. A separate set of manifests is used for delete files, but the manifest schemas are identical. +Deletion vectors identify deleted rows of a file by encoding deleted positions in a bitmap. A set bit at position P indicates that the row at position P is deleted. -Both position and equality deletes allow encoding deleted row values with a delete. This can be used to reconstruct a stream of changes to a table. +These vectors are stored using the `delete-vector-v1` blob definition from the [Puffin spec][puffin-spec]. +Deletion vectors support positive 64-bit positions, but are optimized for cases where most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a 32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using the least significant 4 bytes. For each key in the set of positions, a 32-bit Roaring bitmap is maintained to store a set of 32-bit sub-positions for that key. + +To test whether a certain position is set, its most significant 4 bytes (the key) are used to find a 32-bit bitmap and the least significant 4 bytes (the sub-position) are tested for inclusion in the bitmap. If a bitmap is not found for the key, then it is not set. + +Delete manifests track deletion vectors individually by the containing file location (`file_path`), starting offset of the DV magic bytes (`blob_offset`), and total length of the deletion vector blob (`blob_size_in_bytes`). Multiple deletion vectors can be stored in the same file. There are no restrictions on the data files that can be referenced by deletion vectors in the same Puffin file. + +At most one deletion vector is allowed per data file in a table. If a DV is written for a data file, it must replace all previously written position delete files so that when a DV is present, readers can safely ignore matching position delete files. + + +[puffin-spec]: https://iceberg.apache.org/puffin-spec/ ### Position Delete Files Position-based delete files identify deleted rows by file and position in one or more data files, and may optionally contain the deleted row. +_Note: Position delete files are **deprecated** in v3. Existing position deletes must be written to delete vectors when updating the position deletes for a data file._ Review Comment: Is there a technical reason to force deprecation here? -- This is an automated message from the Apache Git Service. To respond to the m
Re: [I] manifest exception [iceberg]
github-actions[bot] closed issue #8994: manifest exception URL: https://github.com/apache/iceberg/issues/8994 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg Glue - Timeouts (maybe others client side error cases) can result in missing metadata_location [iceberg]
github-actions[bot] commented on issue #9618: URL: https://github.com/apache/iceberg/issues/9618#issuecomment-2409443905 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Migrate RESTCatalogServlet to use jakarta.* package for Spring boot 3 [iceberg]
github-actions[bot] commented on issue #9626: URL: https://github.com/apache/iceberg/issues/9626#issuecomment-2409443950 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Optionally Overwrite All Partitions [iceberg]
github-actions[bot] commented on PR #9644: URL: https://github.com/apache/iceberg/pull/9644#issuecomment-2409443994 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Add a procedure to remove corrupt snapshots [iceberg]
github-actions[bot] commented on PR #9645: URL: https://github.com/apache/iceberg/pull/9645#issuecomment-2409444041 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] manifest exception [iceberg]
github-actions[bot] commented on issue #8994: URL: https://github.com/apache/iceberg/issues/8994#issuecomment-2409443156 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] I can't find any detailed explanation about column metric options on the official docs for Iceberg configuration [iceberg]
github-actions[bot] commented on issue #8995: URL: https://github.com/apache/iceberg/issues/8995#issuecomment-2409443202 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Doc Bug: Iceberg Flink Example uses unsupported UNIQUE constraint [iceberg]
github-actions[bot] commented on issue #8997: URL: https://github.com/apache/iceberg/issues/8997#issuecomment-2409443247 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Doc Bug: Iceberg Flink Example uses unsupported UNIQUE constraint [iceberg]
github-actions[bot] closed issue #8997: Doc Bug: Iceberg Flink Example uses unsupported UNIQUE constraint URL: https://github.com/apache/iceberg/issues/8997 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] I can't find any detailed explanation about column metric options on the official docs for Iceberg configuration [iceberg]
github-actions[bot] closed issue #8995: I can't find any detailed explanation about column metric options on the official docs for Iceberg configuration URL: https://github.com/apache/iceberg/issues/8995 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
singhpk234 commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798637028 ## core/src/main/java/org/apache/iceberg/rest/RESTFileScanTaskParser.java: ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.util.List; +import org.apache.iceberg.BaseFileScanTask; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileScanTask; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.ExpressionParser; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.expressions.ResidualEvaluator; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; + +public class RESTFileScanTaskParser { + private static final String DATA_FILE = "data-file"; + private static final String DELETE_FILE_REFERENCES = "delete-file-references"; + private static final String RESIDUAL = "residual-filter"; + + private RESTFileScanTaskParser() {} + + public static void toJson( + FileScanTask fileScanTask, List deleteFiles, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(fileScanTask != null, "Invalid file scan task: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); +generator.writeFieldName(DATA_FILE); +RESTContentFileParser.toJson(fileScanTask.file(), generator); + +// TODO revisit this logic +if (deleteFiles != null) { + generator.writeArrayFieldStart(DELETE_FILE_REFERENCES); + for (int delIndex = 0; delIndex < deleteFiles.size(); delIndex++) { +generator.writeNumber(delIndex); + } + generator.writeEndArray(); +} +if (fileScanTask.residual() != null) { + generator.writeFieldName(RESIDUAL); + ExpressionParser.toJson(fileScanTask.residual(), generator); +} +generator.writeEndObject(); + } + + public static FileScanTask fromJson(JsonNode jsonNode, List allDeleteFiles) { +Preconditions.checkArgument(jsonNode != null, "Invalid JSON node for file scan task: null"); +Preconditions.checkArgument( +jsonNode.isObject(), "Invalid JSON node for file scan task: non-object (%s)", jsonNode); + +DataFile dataFile = (DataFile) RESTContentFileParser.fromJson(jsonNode.get(DATA_FILE)); + +DeleteFile[] matchedDeleteFiles = null; +List deleteFileReferences = null; +if (jsonNode.has(DELETE_FILE_REFERENCES)) { + ImmutableList.Builder deleteFileReferencesBuilder = ImmutableList.builder(); + JsonNode deletesArray = jsonNode.get(DELETE_FILE_REFERENCES); + for (JsonNode deleteRef : deletesArray) { +deleteFileReferencesBuilder.add(deleteRef); + } + deleteFileReferences = deleteFileReferencesBuilder.build(); +} + +if (deleteFileReferences != null) { + ImmutableList.Builder matchedDeleteFilesBuilder = ImmutableList.builder(); + for (Integer deleteFileIdx : deleteFileReferences) { +matchedDeleteFilesBuilder.add(allDeleteFiles.get(deleteFileIdx)); + } + matchedDeleteFiles = (DeleteFile[]) matchedDeleteFilesBuilder.build().stream().toArray(); +} + +// TODO revisit this in spec +Expression filter = Expressions.alwaysTrue(); +if (jsonNode.has(RESIDUAL)) { + filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL)); +} + +ResidualEvaluator residualEvaluator = ResidualEvaluator.of(filter); + +// TODO at the time of creation we dont have the schemaString and specString so can we avoid +// setting this +// will need to refresh before returning closed iterable of tasks, for now put place holder null +BaseFileScanTask baseFileScanTask = +new BaseFileScanTask(dataFile, matchedDeleteFiles, null, null, residualEvaluator); Review Comment: [doubt] These fileScanTasks can be belong to diff snapshots (lets say we
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
singhpk234 commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798637028 ## core/src/main/java/org/apache/iceberg/rest/RESTFileScanTaskParser.java: ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.util.List; +import org.apache.iceberg.BaseFileScanTask; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.DeleteFile; +import org.apache.iceberg.FileScanTask; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.ExpressionParser; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.expressions.ResidualEvaluator; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; + +public class RESTFileScanTaskParser { + private static final String DATA_FILE = "data-file"; + private static final String DELETE_FILE_REFERENCES = "delete-file-references"; + private static final String RESIDUAL = "residual-filter"; + + private RESTFileScanTaskParser() {} + + public static void toJson( + FileScanTask fileScanTask, List deleteFiles, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(fileScanTask != null, "Invalid file scan task: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); +generator.writeFieldName(DATA_FILE); +RESTContentFileParser.toJson(fileScanTask.file(), generator); + +// TODO revisit this logic +if (deleteFiles != null) { + generator.writeArrayFieldStart(DELETE_FILE_REFERENCES); + for (int delIndex = 0; delIndex < deleteFiles.size(); delIndex++) { +generator.writeNumber(delIndex); + } + generator.writeEndArray(); +} +if (fileScanTask.residual() != null) { + generator.writeFieldName(RESIDUAL); + ExpressionParser.toJson(fileScanTask.residual(), generator); +} +generator.writeEndObject(); + } + + public static FileScanTask fromJson(JsonNode jsonNode, List allDeleteFiles) { +Preconditions.checkArgument(jsonNode != null, "Invalid JSON node for file scan task: null"); +Preconditions.checkArgument( +jsonNode.isObject(), "Invalid JSON node for file scan task: non-object (%s)", jsonNode); + +DataFile dataFile = (DataFile) RESTContentFileParser.fromJson(jsonNode.get(DATA_FILE)); + +DeleteFile[] matchedDeleteFiles = null; +List deleteFileReferences = null; +if (jsonNode.has(DELETE_FILE_REFERENCES)) { + ImmutableList.Builder deleteFileReferencesBuilder = ImmutableList.builder(); + JsonNode deletesArray = jsonNode.get(DELETE_FILE_REFERENCES); + for (JsonNode deleteRef : deletesArray) { +deleteFileReferencesBuilder.add(deleteRef); + } + deleteFileReferences = deleteFileReferencesBuilder.build(); +} + +if (deleteFileReferences != null) { + ImmutableList.Builder matchedDeleteFilesBuilder = ImmutableList.builder(); + for (Integer deleteFileIdx : deleteFileReferences) { +matchedDeleteFilesBuilder.add(allDeleteFiles.get(deleteFileIdx)); + } + matchedDeleteFiles = (DeleteFile[]) matchedDeleteFilesBuilder.build().stream().toArray(); +} + +// TODO revisit this in spec +Expression filter = Expressions.alwaysTrue(); +if (jsonNode.has(RESIDUAL)) { + filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL)); +} + +ResidualEvaluator residualEvaluator = ResidualEvaluator.of(filter); + +// TODO at the time of creation we dont have the schemaString and specString so can we avoid +// setting this +// will need to refresh before returning closed iterable of tasks, for now put place holder null +BaseFileScanTask baseFileScanTask = +new BaseFileScanTask(dataFile, matchedDeleteFiles, null, null, residualEvaluator); Review Comment: [doubt] These fileScanTasks can be belong to diff snapshots and we would
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798635808 ## core/src/main/java/org/apache/iceberg/RESTTable.java: ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.util.Map; +import java.util.function.Supplier; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.metrics.MetricsReporter; +import org.apache.iceberg.rest.RESTClient; +import org.apache.iceberg.rest.ResourcePaths; + +public class RESTTable extends BaseTable { Review Comment: Sounds good, I can understand the appeal of the `RESTTable` concept, especially with being able to override the operation implementations. I'm not very against it, just trying to avoid any unnecessary public classes being exposed, if there's somehow we can handle the redirection to the REST implementation internally of those things you mentioned. I'd say let's stick with what you have for now, as we write tests and see more integration code with engines like Spark, we can determine what's the right pattern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Can we make commits inside compaction jobs with partial-progress.enabled sequential to avoid CommitFailedException? [iceberg]
github-actions[bot] commented on issue #9687: URL: https://github.com/apache/iceberg/issues/9687#issuecomment-2409444642 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] java.lang.IllegalArgumentException: requirement failed: length (-6235972) cannot be smaller than -1 [iceberg]
github-actions[bot] commented on issue #9689: URL: https://github.com/apache/iceberg/issues/9689#issuecomment-2409444689 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Pass input file into iterators to get the file name [iceberg]
github-actions[bot] commented on PR #9691: URL: https://github.com/apache/iceberg/pull/9691#issuecomment-2409444741 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Docs: Go over docs to check rendering of pages/sections [iceberg]
github-actions[bot] commented on issue #9657: URL: https://github.com/apache/iceberg/issues/9657#issuecomment-2409444136 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Docs: Add Mandarin translation of the docs site [iceberg]
github-actions[bot] commented on issue #9665: URL: https://github.com/apache/iceberg/issues/9665#issuecomment-2409444237 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Website: Add release schedule on the releases page [iceberg]
github-actions[bot] commented on PR #9666: URL: https://github.com/apache/iceberg/pull/9666#issuecomment-2409444288 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Operations on partition columns in `WHERE` clause not used in pruning [iceberg]
github-actions[bot] commented on issue #9678: URL: https://github.com/apache/iceberg/issues/9678#issuecomment-240952 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support for pushdown like filter (endsWith and contains) [iceberg]
github-actions[bot] commented on PR #9683: URL: https://github.com/apache/iceberg/pull/9683#issuecomment-2409444593 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump junit from 5.10.1 to 5.10.2 [iceberg]
github-actions[bot] commented on PR #9699: URL: https://github.com/apache/iceberg/pull/9699#issuecomment-2409444871 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] HMS lock timeout [iceberg]
github-actions[bot] commented on issue #9654: URL: https://github.com/apache/iceberg/issues/9654#issuecomment-2409444091 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] HIVE-28021: escape percent symbol [iceberg]
github-actions[bot] commented on PR #9667: URL: https://github.com/apache/iceberg/pull/9667#issuecomment-2409444347 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg Rewrite DataFiles unmanageable behavior [iceberg]
github-actions[bot] commented on issue #9674: URL: https://github.com/apache/iceberg/issues/9674#issuecomment-240901 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Made IcebergFilesCommitter work with single phase commit [iceberg]
github-actions[bot] commented on PR #9694: URL: https://github.com/apache/iceberg/pull/9694#issuecomment-2409444787 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Docs: Deprecate distinct_counts since it is no longer used in codebase [iceberg]
github-actions[bot] commented on PR #9680: URL: https://github.com/apache/iceberg/pull/9680#issuecomment-240998 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Fix header links with underscores in title [iceberg]
github-actions[bot] commented on PR #9697: URL: https://github.com/apache/iceberg/pull/9697#issuecomment-2409444828 This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the d...@iceberg.apache.org list. Thank you for your contributions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798639774 ## core/src/main/java/org/apache/iceberg/RESTPlanningMode.java: ## @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.util.Locale; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +public enum RESTPlanningMode { + REQUIRED("required"), + SUPPORTED("supported"), + UNSUPPORTED("unsupported"); Review Comment: Sounds good. After some more thought, my 2c is that I think I'd rather us try and get the model + client side changes into 1.7 rather than expand the scope with this aspect since it's quite useful without these things defined. I'd rather not have client side changes depend on a spec change decision. For now, I think keeping it simple with a catalog client side property for controlling if server side planning is performed is the way. Once the model and client side changes are in, I think it'd make total sense to revisit the spec changes you mentioned. cc @rdblue @danielcweeks for their thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798639774 ## core/src/main/java/org/apache/iceberg/RESTPlanningMode.java: ## @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.util.Locale; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +public enum RESTPlanningMode { + REQUIRED("required"), + SUPPORTED("supported"), + UNSUPPORTED("unsupported"); Review Comment: Sounds good. After some more thought, my 2c is that I think I'd rather us try and get the model + client side changes into 1.7 rather than expand the scope with this aspect since it's quite useful without these things defined. I'd rather not have client side changes depend on another spec change decision. For now, I think keeping it simple with a catalog client side property for controlling if server side planning is performed is the way forward. Once the model and client side changes are in, I think it'd make total sense to revisit the spec changes you mentioned. cc @rdblue @danielcweeks for their thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798639774 ## core/src/main/java/org/apache/iceberg/RESTPlanningMode.java: ## @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.util.Locale; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +public enum RESTPlanningMode { + REQUIRED("required"), + SUPPORTED("supported"), + UNSUPPORTED("unsupported"); Review Comment: Sounds good. After some more thought, my 2c is that I think I'd rather us try and get the model + client side changes into 1.7 rather than expand the scope with this aspect since it's quite useful without these things defined. I'd rather not have client side changes depend on a spec change decision. For now, I think keeping it simple with a catalog client side property for controlling if server side planning is performed is the way forward. Once the model and client side changes are in, I think it'd make total sense to revisit the spec changes you mentioned. cc @rdblue @danielcweeks for their thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
amogh-jahagirdar commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798639774 ## core/src/main/java/org/apache/iceberg/RESTPlanningMode.java: ## @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg; + +import java.util.Locale; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; + +public enum RESTPlanningMode { + REQUIRED("required"), + SUPPORTED("supported"), + UNSUPPORTED("unsupported"); Review Comment: Sounds good. After some more thought, my 2c is that I think I'd rather us try and get the model + client side changes into 1.7 rather than expand the scope with this aspect since it's quite useful without these things defined. I'd rather not have client side changes depend on another spec change decision. For now, I think keeping it simple with a catalog client side property for controlling if server side planning is performed is the way forward. Once the model and client side changes are in, I think it'd make total sense to revisit the spec changes you mentioned and these specific planning mode changes. cc @rdblue @danielcweeks for their thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark: Merge new position deletes with old deletes during writing [iceberg]
amogh-jahagirdar commented on code in PR #11273: URL: https://github.com/apache/iceberg/pull/11273#discussion_r1798640841 ## spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java: ## @@ -158,6 +163,26 @@ public void filter(Predicate[] predicates) { } } + protected Map dataToFileScopedDeletes() { +if (dataToFileScopedDeletes == null) { + dataToFileScopedDeletes = Maps.newHashMap(); + for (ScanTask task : tasks()) { +FileScanTask fileScanTask = task.asFileScanTask(); +List fileScopedDeletes = +fileScanTask.deletes().stream() +.filter(file -> ContentFileUtil.referencedDataFileLocation(file) != null) Review Comment: Agreed, added a `isFileScopedDelete` API to `ContentFileUtil`! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
rahil-c commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798718031 ## core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java: ## @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; +import org.apache.iceberg.ContentFile; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.FileContent; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.GenericDataFile; +import org.apache.iceberg.GenericDeleteFile; +import org.apache.iceberg.Metrics; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.SingleValueParser; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +public class RESTContentFileParser { + private static final String SPEC_ID = "spec-id"; + private static final String CONTENT = "content"; + private static final String FILE_PATH = "file-path"; + private static final String FILE_FORMAT = "file-format"; + private static final String PARTITION = "partition"; + private static final String RECORD_COUNT = "record-count"; + private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes"; + private static final String COLUMN_SIZES = "column-sizes"; + private static final String VALUE_COUNTS = "value-counts"; + private static final String NULL_VALUE_COUNTS = "null-value-counts"; + private static final String NAN_VALUE_COUNTS = "nan-value-counts"; + private static final String LOWER_BOUNDS = "lower-bounds"; + private static final String UPPER_BOUNDS = "upper-bounds"; + private static final String KEY_METADATA = "key-metadata"; + private static final String SPLIT_OFFSETS = "split-offsets"; + private static final String EQUALITY_IDS = "equality-ids"; + private static final String SORT_ORDER_ID = "sort-order-id"; + + private RESTContentFileParser() {} + + public static String toJson(ContentFile contentFile) { +return JsonUtil.generate( +generator -> RESTContentFileParser.toJson(contentFile, generator), false); + } + + public static void toJson(ContentFile contentFile, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(contentFile != null, "Invalid content file: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); + +generator.writeNumberField(SPEC_ID, contentFile.specId()); +generator.writeStringField(CONTENT, contentFile.content().name()); +generator.writeStringField(FILE_PATH, contentFile.path().toString()); +generator.writeStringField(FILE_FORMAT, contentFile.format().name()); + +generator.writeFieldName(PARTITION); + +// TODO at the time of serialization we dont have the partition spec we just have spec id. +// we will need to get the spec from table metadata using spec id. +// or we will need to send parition spec, put null here for now until refresh +SingleValueParser.toJson(null, contentFile.partition(), generator); + +generator.writeNumberField(FILE_SIZE_IN_BYTES, contentFile.fileSizeInBytes()); + +metricsToJson(contentFile, generator); + +if (contentFile.keyMetadata() != null) { + generator.writeFieldName(KEY_METADATA); + SingleValueParser.toJson(DataFile.KEY_METADATA.type(), contentFile.keyMetadata(), generator); +} + +if (contentFile.splitOffsets() != null) { + JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), generator); +} + +if (contentFile.equalityFieldIds() != null) { + JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), generator); +} + +if (contentFile.sortOrderId() != null) { + generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId()); +} + +generator.writeEndObject(); + } + + public static ContentFile fromJson(JsonNode jsonNode) { +Preconditions.checkArgument(jsonNode != null, "Invali
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
singhpk234 commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798744690 ## core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java: ## @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; +import org.apache.iceberg.ContentFile; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.FileContent; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.GenericDataFile; +import org.apache.iceberg.GenericDeleteFile; +import org.apache.iceberg.Metrics; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.SingleValueParser; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +public class RESTContentFileParser { + private static final String SPEC_ID = "spec-id"; + private static final String CONTENT = "content"; + private static final String FILE_PATH = "file-path"; + private static final String FILE_FORMAT = "file-format"; + private static final String PARTITION = "partition"; + private static final String RECORD_COUNT = "record-count"; + private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes"; + private static final String COLUMN_SIZES = "column-sizes"; + private static final String VALUE_COUNTS = "value-counts"; + private static final String NULL_VALUE_COUNTS = "null-value-counts"; + private static final String NAN_VALUE_COUNTS = "nan-value-counts"; + private static final String LOWER_BOUNDS = "lower-bounds"; + private static final String UPPER_BOUNDS = "upper-bounds"; + private static final String KEY_METADATA = "key-metadata"; + private static final String SPLIT_OFFSETS = "split-offsets"; + private static final String EQUALITY_IDS = "equality-ids"; + private static final String SORT_ORDER_ID = "sort-order-id"; + + private RESTContentFileParser() {} + + public static String toJson(ContentFile contentFile) { +return JsonUtil.generate( +generator -> RESTContentFileParser.toJson(contentFile, generator), false); + } + + public static void toJson(ContentFile contentFile, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(contentFile != null, "Invalid content file: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); + +generator.writeNumberField(SPEC_ID, contentFile.specId()); +generator.writeStringField(CONTENT, contentFile.content().name()); +generator.writeStringField(FILE_PATH, contentFile.path().toString()); +generator.writeStringField(FILE_FORMAT, contentFile.format().name()); + +generator.writeFieldName(PARTITION); + +// TODO at the time of serialization we dont have the partition spec we just have spec id. +// we will need to get the spec from table metadata using spec id. +// or we will need to send parition spec, put null here for now until refresh +SingleValueParser.toJson(null, contentFile.partition(), generator); + +generator.writeNumberField(FILE_SIZE_IN_BYTES, contentFile.fileSizeInBytes()); + +metricsToJson(contentFile, generator); + +if (contentFile.keyMetadata() != null) { + generator.writeFieldName(KEY_METADATA); + SingleValueParser.toJson(DataFile.KEY_METADATA.type(), contentFile.keyMetadata(), generator); +} + +if (contentFile.splitOffsets() != null) { + JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), generator); +} + +if (contentFile.equalityFieldIds() != null) { + JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), generator); +} + +if (contentFile.sortOrderId() != null) { + generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId()); +} + +generator.writeEndObject(); + } + + public static ContentFile fromJson(JsonNode jsonNode) { +Preconditions.checkArgument(jsonNode != null, "Inv
Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]
singhpk234 commented on code in PR #11180: URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798744690 ## core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java: ## @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.rest; + +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Map; +import org.apache.iceberg.ContentFile; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.FileContent; +import org.apache.iceberg.FileFormat; +import org.apache.iceberg.GenericDataFile; +import org.apache.iceberg.GenericDeleteFile; +import org.apache.iceberg.Metrics; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.SingleValueParser; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.util.JsonUtil; + +public class RESTContentFileParser { + private static final String SPEC_ID = "spec-id"; + private static final String CONTENT = "content"; + private static final String FILE_PATH = "file-path"; + private static final String FILE_FORMAT = "file-format"; + private static final String PARTITION = "partition"; + private static final String RECORD_COUNT = "record-count"; + private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes"; + private static final String COLUMN_SIZES = "column-sizes"; + private static final String VALUE_COUNTS = "value-counts"; + private static final String NULL_VALUE_COUNTS = "null-value-counts"; + private static final String NAN_VALUE_COUNTS = "nan-value-counts"; + private static final String LOWER_BOUNDS = "lower-bounds"; + private static final String UPPER_BOUNDS = "upper-bounds"; + private static final String KEY_METADATA = "key-metadata"; + private static final String SPLIT_OFFSETS = "split-offsets"; + private static final String EQUALITY_IDS = "equality-ids"; + private static final String SORT_ORDER_ID = "sort-order-id"; + + private RESTContentFileParser() {} + + public static String toJson(ContentFile contentFile) { +return JsonUtil.generate( +generator -> RESTContentFileParser.toJson(contentFile, generator), false); + } + + public static void toJson(ContentFile contentFile, JsonGenerator generator) + throws IOException { +Preconditions.checkArgument(contentFile != null, "Invalid content file: null"); +Preconditions.checkArgument(generator != null, "Invalid JSON generator: null"); + +generator.writeStartObject(); + +generator.writeNumberField(SPEC_ID, contentFile.specId()); +generator.writeStringField(CONTENT, contentFile.content().name()); +generator.writeStringField(FILE_PATH, contentFile.path().toString()); +generator.writeStringField(FILE_FORMAT, contentFile.format().name()); + +generator.writeFieldName(PARTITION); + +// TODO at the time of serialization we dont have the partition spec we just have spec id. +// we will need to get the spec from table metadata using spec id. +// or we will need to send parition spec, put null here for now until refresh +SingleValueParser.toJson(null, contentFile.partition(), generator); + +generator.writeNumberField(FILE_SIZE_IN_BYTES, contentFile.fileSizeInBytes()); + +metricsToJson(contentFile, generator); + +if (contentFile.keyMetadata() != null) { + generator.writeFieldName(KEY_METADATA); + SingleValueParser.toJson(DataFile.KEY_METADATA.type(), contentFile.keyMetadata(), generator); +} + +if (contentFile.splitOffsets() != null) { + JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), generator); +} + +if (contentFile.equalityFieldIds() != null) { + JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), generator); +} + +if (contentFile.sortOrderId() != null) { + generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId()); +} + +generator.writeEndObject(); + } + + public static ContentFile fromJson(JsonNode jsonNode) { +Preconditions.checkArgument(jsonNode != null, "Inv
Re: [PR] Arrow: Deprecate unused fixed width binary reader classes [iceberg]
nastra merged PR #11292: URL: https://github.com/apache/iceberg/pull/11292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]
wgtmac commented on code in PR #11240: URL: https://github.com/apache/iceberg/pull/11240#discussion_r1798786736 ## format/spec.md: ## @@ -454,35 +457,40 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name| Type | Description | -| -- | -- |---|--|-| -|| _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`**| `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ || ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ || ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`**| `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>`| Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -|| _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | v3 | Field id, name| Type | Description | +| -- | -- | -- |---|--|-| +|| _required_ | _required_ | **`134 content`**| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | _requi
Re: [I] If we replaced or dropped partition spec field and drop the corresponding column, we can't select table again [iceberg]
bknbkn commented on issue #11314: URL: https://github.com/apache/iceberg/issues/11314#issuecomment-2409677616 The reason for this problem seems to be that each spec uses the latest schema, and historical specs may not be able to find fields in the latest schema. I think it is necessary to persist the schma id into each spec. Based on this, each PartitionSpec can find its own schema when it is generated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs: README uses iceberg-rust instead of we [iceberg-rust]
Xuanwo merged PR #667: URL: https://github.com/apache/iceberg-rust/pull/667 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore(deps): Bump crate-ci/typos from 1.25.0 to 1.26.0 [iceberg-rust]
Xuanwo merged PR #668: URL: https://github.com/apache/iceberg-rust/pull/668 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spec: Adds Row Lineage [iceberg]
wgtmac commented on code in PR #11130: URL: https://github.com/apache/iceberg/pull/11130#discussion_r1798768067 ## format/spec.md: ## @@ -598,6 +702,14 @@ Notes: 1. Lower and upper bounds are serialized to bytes using the single-object serialization in Appendix D. The type of used to encode the value is the type of the partition field data. 2. If -0.0 is a value of the partition field, the `lower_bound` must not be +0.0, and if +0.0 is a value of the partition field, the `upper_bound` must not be -0.0. + First Row ID Assignment + +Row ID inheritance is used when row lineage is enabled. When not enabled, a manifest's `first_row_id` must always be set to `null`. The rest of this section applies when row lineage is enabled. Review Comment: A related question: if we revert the table to a snapshot before enabling the row lineage, should we disable row lineage? If not, what about `next_row_id`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] OpenAPI: Standardize credentials in loadTable/loadView responses [iceberg]
nastra commented on code in PR #10722: URL: https://github.com/apache/iceberg/pull/10722#discussion_r1798841576 ## open-api/rest-catalog-open-api.yaml: ## @@ -3129,6 +3145,11 @@ components: - `s3.secret-access-key`: secret for credentials that provide access to data in S3 - `s3.session-token`: if present, this value should be used for as the session token - `s3.remote-signing-enabled`: if `true` remote signing should be performed as described in the `s3-signer-open-api.yaml` specification + +## Storage Credentials + +Credentials for ADLS / GCS / S3 / ... are provided through the `storage-credentials` field. Review Comment: yes exactly, we're trying to move the docs to this new credential section -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates [iceberg]
nastra merged PR #11261: URL: https://github.com/apache/iceberg/pull/11261 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Cannot delete column with nested field filter [iceberg]
nastra closed issue #7065: Cannot delete column with nested field filter URL: https://github.com/apache/iceberg/issues/7065 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org