date:20241013

[PR] Kafka Connect: Add config to route to tables using topic name [iceberg]

2024-10-13 Thread via GitHub



xiasongh opened a new pull request, #11313:
URL: https://github.com/apache/iceberg/pull/11313

   Add a new config `iceberg.tables.route-pattern` to dynamically route to 
Iceberg tables using Kafka topic name
   
   Closes #11163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Kafka Connect: Add config to route to tables using topic name [iceberg]

2024-10-13 Thread via GitHub



xiasongh commented on code in PR #11313:
URL: https://github.com/apache/iceberg/pull/11313#discussion_r1798594599


##
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java:
##
@@ -133,6 +145,20 @@ private String extractRouteValue(Object recordValue, 
String routeField) {
 return routeValue == null ? null : routeValue.toString();
   }
 
+  private String formatRoutePattern(SinkRecord record, String routePattern) {
+if (routePattern == null) {
+  return null;
+}
+
+String topicName = record.topic();
+if (topicName == null) {
+  return null;
+}
+
+// replace topic namespace separator
+return routePattern.replace("{topic}", topicName.replace(".", "_"));

Review Comment:
   > topicName.replace(".", "_")
   I use AWS Glue catalog, which doesn't support nested namespaces. Topic names 
with more than 1 `.` would be invalid table names
   
   The Debezium JDBC sink connector [0] also does this, so it's not totally 
unheard of
   
   It probably makes most sense to turn this into it's own config, maybe 
something like `iceberg.tables.route-pattern.namespace-separator`?
   
   [0] 
https://debezium.io/documentation/reference/stable/connectors/jdbc.html#jdbc-property-table-naming-strategy
   
   



##
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java:
##
@@ -133,6 +145,20 @@ private String extractRouteValue(Object recordValue, 
String routeField) {
 return routeValue == null ? null : routeValue.toString();
   }
 
+  private String formatRoutePattern(SinkRecord record, String routePattern) {
+if (routePattern == null) {
+  return null;
+}
+
+String topicName = record.topic();
+if (topicName == null) {
+  return null;
+}
+
+// replace topic namespace separator
+return routePattern.replace("{topic}", topicName.replace(".", "_"));

Review Comment:
   > topicName.replace(".", "_")
   
   I use AWS Glue catalog, which doesn't support nested namespaces. Topic names 
with more than 1 `.` would be invalid table names
   
   The Debezium JDBC sink connector [0] also does this, so it's not totally 
unheard of
   
   It probably makes most sense to turn this into it's own config, maybe 
something like `iceberg.tables.route-pattern.namespace-separator`?
   
   [0] 
https://debezium.io/documentation/reference/stable/connectors/jdbc.html#jdbc-property-table-naming-strategy
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Kafka Connect: Add config to route to tables using topic name [iceberg]

2024-10-13 Thread via GitHub



xiasongh commented on code in PR #11313:
URL: https://github.com/apache/iceberg/pull/11313#discussion_r1798594599


##
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java:
##
@@ -133,6 +145,20 @@ private String extractRouteValue(Object recordValue, 
String routeField) {
 return routeValue == null ? null : routeValue.toString();
   }
 
+  private String formatRoutePattern(SinkRecord record, String routePattern) {
+if (routePattern == null) {
+  return null;
+}
+
+String topicName = record.topic();
+if (topicName == null) {
+  return null;
+}
+
+// replace topic namespace separator
+return routePattern.replace("{topic}", topicName.replace(".", "_"));

Review Comment:
   > topicName.replace(".", "_")
   
   I use AWS Glue catalog, which doesn't support nested namespaces. Topic names 
with more than 1 `.` would be invalid table names, so one thing we can do is 
replace all the `.` characters
   
   The Debezium JDBC sink connector [0] also does this, so it's not totally 
unheard of
   
   It probably makes most sense to turn this into it's own config, maybe 
something like `iceberg.tables.route-pattern.namespace-separator`? Thoughts?
   
   [0] 
https://debezium.io/documentation/reference/stable/connectors/jdbc.html#jdbc-property-table-naming-strategy
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Kafka Connect: Add config to route to tables using topic name [iceberg]

2024-10-13 Thread via GitHub



xiasongh commented on code in PR #11313:
URL: https://github.com/apache/iceberg/pull/11313#discussion_r1798594599


##
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/SinkWriter.java:
##
@@ -133,6 +145,20 @@ private String extractRouteValue(Object recordValue, 
String routeField) {
 return routeValue == null ? null : routeValue.toString();
   }
 
+  private String formatRoutePattern(SinkRecord record, String routePattern) {
+if (routePattern == null) {
+  return null;
+}
+
+String topicName = record.topic();
+if (topicName == null) {
+  return null;
+}
+
+// replace topic namespace separator
+return routePattern.replace("{topic}", topicName.replace(".", "_"));

Review Comment:
   > topicName.replace(".", "_")
   
   I use AWS Glue catalog, which doesn't support nested namespaces. Topic names 
with more than 1 `.` would be invalid table names, so one thing we can do is 
replace all the `.` characters
   
   The Debezium JDBC sink connector [0] also does this, so it's not totally 
unheard of
   
   It probably makes most sense to turn this into it's own config, maybe 
something like `iceberg.tables.route-pattern.namespace-separator`?
   
   [0] 
https://debezium.io/documentation/reference/stable/connectors/jdbc.html#jdbc-property-table-naming-strategy
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Kafka Connect: route to table using topic name [iceberg]

2024-10-13 Thread via GitHub



xiasongh commented on issue #11163:
URL: https://github.com/apache/iceberg/issues/11163#issuecomment-2409142288

   @bryanck Sorry for the delay, was able to give this a shot. Please have a 
look. I'm not at all familiar with Java so forgive me if I did anything silly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Flink: Add RowConverter for Iceberg Source [iceberg]

2024-10-13 Thread via GitHub



abharath9 opened a new issue, #11312:
URL: https://github.com/apache/iceberg/issues/11312

   ### Feature Request / Improvement
   
   Currently we can't create views on top of IcebergSource DataStreams 
directly. We need to convert the RowData to Row explicitly using map function. 
I thought creating a RowConverter to convert RowData to Row and return 
IcebergSource would be a good idea. This approach enables the use of the 
Iceberg schema to create a DataStream and further create a view on top of it. 
What are your thoughts?
   
   ### Query engine
   
   Flink
   
   ### Willingness to contribute
   
   - [X] I can contribute this improvement/feature independently
   - [ ] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Flink: Add RowConverter for Iceberg Source [iceberg]

2024-10-13 Thread via GitHub



abharath9 commented on issue #11312:
URL: https://github.com/apache/iceberg/issues/11312#issuecomment-2409092087

   Implemented this change and created a pr. Can i get a review on this?
   https://github.com/apache/iceberg/pull/11301
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.61.2 to 1.65.2 [iceberg-go]

2024-10-13 Thread via GitHub



dependabot[bot] commented on PR #166:
URL: https://github.com/apache/iceberg-go/pull/166#issuecomment-2408876440

   Superseded by #170.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.61.2 to 1.65.2 [iceberg-go]

2024-10-13 Thread via GitHub



dependabot[bot] closed pull request #166: build(deps): bump 
github.com/aws/aws-sdk-go-v2/service/s3 from 1.61.2 to 1.65.2
URL: https://github.com/apache/iceberg-go/pull/166


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.61.2 to 1.65.3 [iceberg-go]

2024-10-13 Thread via GitHub



dependabot[bot] opened a new pull request, #170:
URL: https://github.com/apache/iceberg-go/pull/170

   Bumps 
[github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) 
from 1.61.2 to 1.65.3.
   
   Commits
   
   https://github.com/aws/aws-sdk-go-v2/commit/071b493afc547a04084be261af44ba204e97c612";>071b493
 Release 2024-10-11
   https://github.com/aws/aws-sdk-go-v2/commit/c70d0118c74a13c807b16b45fcbc8b82e061da30";>c70d011
 Regenerated Clients
   https://github.com/aws/aws-sdk-go-v2/commit/f98b7e121460ce1c7e29f916c60d1f3f9f8895e8";>f98b7e1
 Update API model
   https://github.com/aws/aws-sdk-go-v2/commit/10c8fe26fbe46b3abd5ee66d7ecbbabae4c95b46";>10c8fe2
 Remove requirement of internal tool to check for version on AWS models (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2832";>#2832)
   https://github.com/aws/aws-sdk-go-v2/commit/28d943f7f66c7095685c8d57dea18944fc3b5c22";>28d943f
 S3 ReplicationRuleFilter and LifecycleRuleFilter shapes are being changed 
fro...
   https://github.com/aws/aws-sdk-go-v2/commit/b34ecd46bb2e14f2786934ef34ed7747c5fe89a8";>b34ecd4
 Release 2024-10-10
   https://github.com/aws/aws-sdk-go-v2/commit/ead7ba38611404d5c32aa92c57cba8057b3cf8a0";>ead7ba3
 Regenerated Clients
   https://github.com/aws/aws-sdk-go-v2/commit/26c58a0c6f861e7be7d6439bceea77bec71fc97a";>26c58a0
 Update API model
   https://github.com/aws/aws-sdk-go-v2/commit/bcff11552060a39aa275ffda9714ecfb6e2572ab";>bcff115
 Release 2024-10-09
   https://github.com/aws/aws-sdk-go-v2/commit/527244530ff3208d45f9b34bde45fac2bd300476";>5272445
 Regenerated Clients
   Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.61.2...service/s3/v1.65.3";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github.com/aws/aws-sdk-go-v2/service/s3&package-manager=go_modules&previous-version=1.61.2&new-version=1.65.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] build(deps): bump github.com/aws/aws-sdk-go-v2/service/glue from 1.99.2 to 1.100.2 [iceberg-go]

2024-10-13 Thread via GitHub



dependabot[bot] opened a new pull request, #171:
URL: https://github.com/apache/iceberg-go/pull/171

   Bumps 
[github.com/aws/aws-sdk-go-v2/service/glue](https://github.com/aws/aws-sdk-go-v2)
 from 1.99.2 to 1.100.2.
   
   Commits
   
   https://github.com/aws/aws-sdk-go-v2/commit/0cbb5aa17f9078cb45dc0e82d3e1d0abee3744a9";>0cbb5aa
 Release 2024-10-08
   https://github.com/aws/aws-sdk-go-v2/commit/54c1dd6c74185b0c7df78159ec4d5b2c27e9e280";>54c1dd6
 Regenerated Clients
   https://github.com/aws/aws-sdk-go-v2/commit/2cde144eedda9f509141751c3011ca64a6b6528e";>2cde144
 Update endpoints model
   https://github.com/aws/aws-sdk-go-v2/commit/67fbd35762ef8694839df209714d2ec2c33d3df9";>67fbd35
 Update API model
   https://github.com/aws/aws-sdk-go-v2/commit/aa04330cb6978ccb6a7bb3e198b3fb21abbd6333";>aa04330
 Allow non-nil but empty headers (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2826";>#2826)
   https://github.com/aws/aws-sdk-go-v2/commit/5a4e5bb42c08ff5a4e0e601a7461c8466565e44e";>5a4e5bb
 add feature tracking for cbor protocol (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2821";>#2821)
   https://github.com/aws/aws-sdk-go-v2/commit/183987cda0c2487a1b25c8e9cbf8dba510046c73";>183987c
 add annotations to deprecated services and introduce codegen integration for 
...
   https://github.com/aws/aws-sdk-go-v2/commit/b737dc9eb14847cd97d3b30ad6a1394efd73245b";>b737dc9
 Release 2024-10-07
   https://github.com/aws/aws-sdk-go-v2/commit/7279a51bbcd597f4aa7aeeb599c017d3d1679fb6";>7279a51
 Regenerated Clients
   https://github.com/aws/aws-sdk-go-v2/commit/a1b1f5a17c687371cc53c5dfbb2bf5ff467ff51a";>a1b1f5a
 Update endpoints model
   Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/service/glue/v1.99.2...service/glue/v1.100.2";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github.com/aws/aws-sdk-go-v2/service/glue&package-manager=go_modules&previous-version=1.99.2&new-version=1.100.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] docs:README uses iceberg-rust instead of we [iceberg-rust]

2024-10-13 Thread via GitHub



caicancai opened a new pull request, #667:
URL: https://github.com/apache/iceberg-rust/pull/667

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates [iceberg]

2024-10-13 Thread via GitHub



zhongyujiang commented on PR #11261:
URL: https://github.com/apache/iceberg/pull/11261#issuecomment-2408925671

   @amogh-jahagirdar Thanks for reviewing, tests updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] chore(deps): Bump crate-ci/typos from 1.25.0 to 1.26.0 [iceberg-rust]

2024-10-13 Thread via GitHub



dependabot[bot] opened a new pull request, #668:
URL: https://github.com/apache/iceberg-rust/pull/668

   Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.25.0 to 
1.26.0.
   
   Release notes
   Sourced from https://github.com/crate-ci/typos/releases";>crate-ci/typos's 
releases.
   
   v1.26.0
   [1.26.0] - 2024-10-07
   Compatibility
   
   (pre-commit) Requires 3.2+
   
   Fixes
   
   (pre-commit) Resolve deprecations in 4.0 about deprecated stage 
names
   
   
   
   
   Changelog
   Sourced from https://github.com/crate-ci/typos/blob/master/CHANGELOG.md";>crate-ci/typos's
 changelog.
   
   [1.26.0] - 2024-10-07
   Compatibility
   
   (pre-commit) Requires 3.2+
   
   Fixes
   
   (pre-commit) Resolve deprecations in 4.0 about deprecated stage 
names
   
   
   
   
   Commits
   
   https://github.com/crate-ci/typos/commit/6802cc60d4e7f78b9d5454f6cf3935c042d5e1e3";>6802cc6
 chore: Release
   https://github.com/crate-ci/typos/commit/caa55026aee3d2cdcaf1f9b0c258651dbb01c283";>caa5502
 docs: Update changelog
   https://github.com/crate-ci/typos/commit/2114c1924169510820bc12e59427851514624ac2";>2114c19
 Merge pull request https://redirect.github.com/crate-ci/typos/issues/1114";>#1114 from 
tobiasraabe/patch-1
   https://github.com/crate-ci/typos/commit/9de7b2c6bed6e32c6b34ed91702ac6eaba138a99";>9de7b2c
 Updates stage names in .pre-commit-hooks.yaml.
   https://github.com/crate-ci/typos/commit/14f49f455cf3b6a38841665e82c3b9135b91c929";>14f49f4
 Merge pull request https://redirect.github.com/crate-ci/typos/issues/1105";>#1105 from 
crate-ci/renovate/unicode-width-0.x
   https://github.com/crate-ci/typos/commit/58ffa4baefb10b607bbc30bd16f7fe8a4446a643";>58ffa4b
 Merge pull request https://redirect.github.com/crate-ci/typos/issues/1108";>#1108 from 
crate-ci/renovate/stable-1.x
   https://github.com/crate-ci/typos/commit/003cb769377a25a6c659c67429585644c5321348";>003cb76
 chore(deps): Update dependency STABLE to v1.81.0
   https://github.com/crate-ci/typos/commit/bc00184a2367b7d354946b042483a30d92e012e9";>bc00184
 chore(deps): Update Rust crate unicode-width to 0.2.0
   See full diff in https://github.com/crate-ci/typos/compare/v1.25.0...v1.26.0";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=crate-ci/typos&package-manager=github_actions&previous-version=1.25.0&new-version=1.26.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [WIP] Core: Prototype for DVs in V3 [iceberg]

2024-10-13 Thread via GitHub



aokolnychyi commented on code in PR #11302:
URL: https://github.com/apache/iceberg/pull/11302#discussion_r1798616398


##
api/src/main/java/org/apache/iceberg/DataFile.java:
##
@@ -98,12 +98,23 @@ public interface DataFile extends ContentFile {
   Types.NestedField SORT_ORDER_ID =
   optional(140, "sort_order_id", IntegerType.get(), "Sort order ID");
   Types.NestedField SPEC_ID = optional(141, "spec_id", IntegerType.get(), 
"Partition spec ID");
+  Types.NestedField REFERENCED_DATA_FILE =

Review Comment:
   Follows the proposed spec, reserving 142 for row lineage.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]

2024-10-13 Thread via GitHub



aokolnychyi commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798617437


##
format/puffin-spec.md:
##
@@ -123,6 +123,54 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+ `delete-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted.  A set bit at position P indicates that the row at

Review Comment:
   True, we may keep this generic for referencing manifests in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]

2024-10-13 Thread via GitHub



aokolnychyi commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798622067


##
format/puffin-spec.md:
##
@@ -123,6 +123,54 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+ `delete-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted.  A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions (the most significant bit must be
+0), but is optimized for cases where most positions fit in 32 bits by using a
+collection of 32-bit Roaring bitmaps.  64-bit positions are divided into a
+32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
+the least significant 4 bytes. For each key in the set of positions, a 32-bit
+Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
+key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`

Review Comment:
   While I don’t think the magic number check is critical, I do believe it is 
beneficial. If things start to fail, we would want to have as much helpful 
information as possible. Having the magic number allows us to cross check the 
serialization format without reading the footer and may help debug issues with 
offsets. It will also be useful if we add more serialization formats in the 
future. I agree it is unlikely we will be able to successfully deserialize the 
rest of the content if the offset is invalid, but still. If we end up in that 
situation, it would mean there was an ugly bug and having more metadata will 
only help. Overall, it does seem like a reasonable sanity check to me, similar 
to magic numbers in zstd and gzip.
   
   We once had to debug issues with bit flips while reading manifests. There 
was no easy way to prove we didn't corrupt the files and it was a faulty disk. 
The CRC check would catch those and save a ton of time.
   
   I’d propose to keep the magic number and CRC independently of whether we 
decide to follow the Delta Lake blob layout. The length and byte orders are 
controversial. There is no merit in those beyond compatibility with Delta Lake.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]

2024-10-13 Thread via GitHub



aokolnychyi commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798623532


##
format/puffin-spec.md:
##
@@ -123,6 +123,54 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+ `delete-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted.  A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions (the most significant bit must be
+0), but is optimized for cases where most positions fit in 32 bits by using a
+collection of 32-bit Roaring bitmaps.  64-bit positions are divided into a
+32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
+the least significant 4 bytes. For each key in the set of positions, a 32-bit
+Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
+key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`
+* The vector, serialized as described below
+* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian
+
+The position vector is serialized using the Roaring bitmap
+["portable" format][roaring-bitmap-portable-serialization]. This representation
+consists of:
+
+* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian
+* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit 
keys:
+- The key stored as 4 bytes, little-endian
+- A [32-bit Roaring bitmap][roaring-bitmap-general-layout]
+
+Note that the length and CRC fields are stored using big-endian, but the
+Roaring bitmap format uses little-endian values. Big endian values were chosen
+for compatibility with existing deletion vectors.
+
+The blob metadata must include the following properties:
+
+* `referenced-data-file`: location of the data file the delete vector applies

Review Comment:
   The cardinality is part of making these delete files self-describing and is 
up for discussion. I can image some maintenance operations compacting DV files 
without touching the rest of the metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]

2024-10-13 Thread via GitHub



aokolnychyi commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798622661


##
format/puffin-spec.md:
##
@@ -123,6 +123,54 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+ `delete-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted.  A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions (the most significant bit must be
+0), but is optimized for cases where most positions fit in 32 bits by using a
+collection of 32-bit Roaring bitmaps.  64-bit positions are divided into a
+32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
+the least significant 4 bytes. For each key in the set of positions, a 32-bit
+Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
+key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`
+* The vector, serialized as described below
+* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian

Review Comment:
   I think a good question to ask the community is how many vendors/engines 
would be interested to potentially reuse the code if they support both Iceberg 
and Delta. Delta DVs are widely used at Databricks, but it is hard to tell 
about other engines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]

2024-10-13 Thread via GitHub



aokolnychyi commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798623532


##
format/puffin-spec.md:
##
@@ -123,6 +123,54 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+ `delete-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
+file that are deleted.  A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions (the most significant bit must be
+0), but is optimized for cases where most positions fit in 32 bits by using a
+collection of 32-bit Roaring bitmaps.  64-bit positions are divided into a
+32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
+the least significant 4 bytes. For each key in the set of positions, a 32-bit
+Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
+key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* The length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`
+* The vector, serialized as described below
+* A CRC-32 checksum of the serialized vector as 4 bytes, big-endian
+
+The position vector is serialized using the Roaring bitmap
+["portable" format][roaring-bitmap-portable-serialization]. This representation
+consists of:
+
+* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian
+* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit 
keys:
+- The key stored as 4 bytes, little-endian
+- A [32-bit Roaring bitmap][roaring-bitmap-general-layout]
+
+Note that the length and CRC fields are stored using big-endian, but the
+Roaring bitmap format uses little-endian values. Big endian values were chosen
+for compatibility with existing deletion vectors.
+
+The blob metadata must include the following properties:
+
+* `referenced-data-file`: location of the data file the delete vector applies

Review Comment:
   The cardinality is part of making these delete files self-describing and is 
up for discussion. I can image some maintenance operations compacting DV files 
without touching the rest of the metadata (I speculate).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]

2024-10-13 Thread via GitHub



aokolnychyi commented on code in PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#discussion_r1798624051


##
format/puffin-spec.md:
##
@@ -123,6 +123,44 @@ The blob metadata for this blob may include following 
properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+ `delete-vector-v1` blob type
+
+A serialized delete vector that represents the positions of rows in a file that
+are deleted.  A set bit at position P indicates that the row at position P is
+deleted.
+
+The bitmap supports positive 64-bit positions, but is optimized for cases where
+most positions fit in 32 bits by using a collection of 32-bit Roaring bitmaps.
+64-bit positions are divided into a 32-bit "key" using the most significant 4
+bytes and a 32-bit position using the least significant 4 bytes. For each key
+in the set of positions, a 32-bit Roaring bitmap is maintained to store a set
+of 32-bit positions for that key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes are
+tested for inclusion in the bitmap. If a bitmap is not found for the key, then
+it is not set.
+
+The serialized blob starts with a 4-byte magic sequence, `D1D33964` (1681511377
+stored as 4 bytes, little-endian). Following the magic bytes is the serialized
+collection of bitmaps. The collection is stored using the Roaring bitmap
+["portable" format][roaring-bitmap-portable-serialization]. This representation
+consists of:
+
+* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian
+* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit 
keys:
+- The key stored as 4 bytes, little-endian
+- A [32-bit Roaring bitmap][roaring-bitmap-general-layout]
+
+The blob metadata must include the following properties:
+
+* `referenced-data-file`: location of the data file the delete vector applies 
to
+* `cardinality`: number of deleted rows (set positions) in the delete vector

Review Comment:
   I am +1 for exploring what it would take to make those fields optional. My 
opinion would depend on the amount of work needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Puffin: Add delete-vector-v1 blob type [iceberg]

2024-10-13 Thread via GitHub



aokolnychyi commented on PR #11238:
URL: https://github.com/apache/iceberg/pull/11238#issuecomment-2409353898

   PR #11302 contains a sample implementation of this spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



singhpk234 commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1795908515


##
core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java:
##
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.ContentFile;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileContent;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.GenericDataFile;
+import org.apache.iceberg.GenericDeleteFile;
+import org.apache.iceberg.Metrics;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.SingleValueParser;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RESTContentFileParser {
+  private static final String SPEC_ID = "spec-id";
+  private static final String CONTENT = "content";
+  private static final String FILE_PATH = "file-path";
+  private static final String FILE_FORMAT = "file-format";
+  private static final String PARTITION = "partition";
+  private static final String RECORD_COUNT = "record-count";
+  private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes";
+  private static final String COLUMN_SIZES = "column-sizes";
+  private static final String VALUE_COUNTS = "value-counts";
+  private static final String NULL_VALUE_COUNTS = "null-value-counts";
+  private static final String NAN_VALUE_COUNTS = "nan-value-counts";
+  private static final String LOWER_BOUNDS = "lower-bounds";
+  private static final String UPPER_BOUNDS = "upper-bounds";
+  private static final String KEY_METADATA = "key-metadata";
+  private static final String SPLIT_OFFSETS = "split-offsets";
+  private static final String EQUALITY_IDS = "equality-ids";
+  private static final String SORT_ORDER_ID = "sort-order-id";
+
+  private RESTContentFileParser() {}
+
+  public static String toJson(ContentFile contentFile) {
+return JsonUtil.generate(
+generator -> RESTContentFileParser.toJson(contentFile, generator), 
false);
+  }
+
+  public static void toJson(ContentFile contentFile, JsonGenerator 
generator)
+  throws IOException {
+Preconditions.checkArgument(contentFile != null, "Invalid content file: 
null");
+Preconditions.checkArgument(generator != null, "Invalid JSON generator: 
null");
+
+generator.writeStartObject();
+
+generator.writeNumberField(SPEC_ID, contentFile.specId());
+generator.writeStringField(CONTENT, contentFile.content().name());
+generator.writeStringField(FILE_PATH, contentFile.path().toString());
+generator.writeStringField(FILE_FORMAT, contentFile.format().name());
+
+generator.writeFieldName(PARTITION);
+
+// TODO at the time of serialization we dont have the partition spec we 
just have spec id.
+// we will need to get the spec from table metadata using spec id.
+// or we will need to send parition spec, put null here for now until 
refresh
+SingleValueParser.toJson(null, contentFile.partition(), generator);
+
+generator.writeNumberField(FILE_SIZE_IN_BYTES, 
contentFile.fileSizeInBytes());
+
+metricsToJson(contentFile, generator);
+
+if (contentFile.keyMetadata() != null) {
+  generator.writeFieldName(KEY_METADATA);
+  SingleValueParser.toJson(DataFile.KEY_METADATA.type(), 
contentFile.keyMetadata(), generator);
+}
+
+if (contentFile.splitOffsets() != null) {
+  JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), 
generator);
+}
+
+if (contentFile.equalityFieldIds() != null) {
+  JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), 
generator);
+}
+
+if (contentFile.sortOrderId() != null) {
+  generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId());
+}
+
+generator.writeEndObject();
+  }
+
+  public static ContentFile fromJson(JsonNode jsonNode) {
+Preconditions.checkArgument(jsonNode != null, "Inv

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



singhpk234 commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798632724


##
core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java:
##
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.ContentFile;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileContent;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.GenericDataFile;
+import org.apache.iceberg.GenericDeleteFile;
+import org.apache.iceberg.Metrics;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.SingleValueParser;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RESTContentFileParser {
+  private static final String SPEC_ID = "spec-id";
+  private static final String CONTENT = "content";
+  private static final String FILE_PATH = "file-path";
+  private static final String FILE_FORMAT = "file-format";
+  private static final String PARTITION = "partition";
+  private static final String RECORD_COUNT = "record-count";
+  private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes";
+  private static final String COLUMN_SIZES = "column-sizes";
+  private static final String VALUE_COUNTS = "value-counts";
+  private static final String NULL_VALUE_COUNTS = "null-value-counts";
+  private static final String NAN_VALUE_COUNTS = "nan-value-counts";
+  private static final String LOWER_BOUNDS = "lower-bounds";
+  private static final String UPPER_BOUNDS = "upper-bounds";
+  private static final String KEY_METADATA = "key-metadata";
+  private static final String SPLIT_OFFSETS = "split-offsets";
+  private static final String EQUALITY_IDS = "equality-ids";
+  private static final String SORT_ORDER_ID = "sort-order-id";
+
+  private RESTContentFileParser() {}
+
+  public static String toJson(ContentFile contentFile) {
+return JsonUtil.generate(
+generator -> RESTContentFileParser.toJson(contentFile, generator), 
false);
+  }
+
+  public static void toJson(ContentFile contentFile, JsonGenerator 
generator)
+  throws IOException {
+Preconditions.checkArgument(contentFile != null, "Invalid content file: 
null");
+Preconditions.checkArgument(generator != null, "Invalid JSON generator: 
null");
+
+generator.writeStartObject();
+
+generator.writeNumberField(SPEC_ID, contentFile.specId());
+generator.writeStringField(CONTENT, contentFile.content().name());
+generator.writeStringField(FILE_PATH, contentFile.path().toString());
+generator.writeStringField(FILE_FORMAT, contentFile.format().name());
+
+generator.writeFieldName(PARTITION);
+
+// TODO at the time of serialization we dont have the partition spec we 
just have spec id.
+// we will need to get the spec from table metadata using spec id.
+// or we will need to send parition spec, put null here for now until 
refresh
+SingleValueParser.toJson(null, contentFile.partition(), generator);
+
+generator.writeNumberField(FILE_SIZE_IN_BYTES, 
contentFile.fileSizeInBytes());
+
+metricsToJson(contentFile, generator);
+
+if (contentFile.keyMetadata() != null) {
+  generator.writeFieldName(KEY_METADATA);
+  SingleValueParser.toJson(DataFile.KEY_METADATA.type(), 
contentFile.keyMetadata(), generator);
+}
+
+if (contentFile.splitOffsets() != null) {
+  JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), 
generator);
+}
+
+if (contentFile.equalityFieldIds() != null) {
+  JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), 
generator);
+}
+
+if (contentFile.sortOrderId() != null) {
+  generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId());
+}
+
+generator.writeEndObject();
+  }
+
+  public static ContentFile fromJson(JsonNode jsonNode) {
+Preconditions.checkArgument(jsonNode != null, "Inv

Re: [PR] Spark: Merge new position deletes with old deletes during writing [iceberg]

2024-10-13 Thread via GitHub



amogh-jahagirdar commented on code in PR #11273:
URL: https://github.com/apache/iceberg/pull/11273#discussion_r1798634769


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java:
##
@@ -185,6 +196,7 @@ public void commit(WriterCommitMessage[] messages) {
 
   int addedDataFilesCount = 0;
   int addedDeleteFilesCount = 0;
+  int removedDeleteFilesCount = 0;

Review Comment:
   Good point, it's not being used but I'd rather include it since it will be 
useful especially as this is a new thing we're adding. I'll update the logs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]

2024-10-13 Thread via GitHub



emkornfield commented on code in PR #11240:
URL: https://github.com/apache/iceberg/pull/11240#discussion_r1798450250


##
format/spec.md:
##
@@ -841,19 +855,45 @@ Notes:
 
 ## Delete Formats
 
-This section details how to encode row-level deletes in Iceberg delete files. 
Row-level deletes are not supported in v1.
+This section details how to encode row-level deletes in Iceberg delete files. 
Row-level deletes are added by v2 and are not supported in v1. Deletion vectors 
are added in v3 and are not supported in v2 or earlier. Position delete files 
must not be added to v3 tables, but existing position delete files are valid.
+
+There are three types of row-level deletes:
+* Deletion vectors (DVs) identify deleted rows within a single referenced data 
file by position in a bitmap
+* Position delete files identify deleted rows by file location and row 
position (**deprecated**)
+* Equality delete files identify deleted rows by the value of one or more 
columns
+
+Deletion vectors are a binary representation of deletes for a single data file 
that is more efficient at execution time than position delete files. Unlike 
equality or position delete files, there can be at most one deletion vector for 
a given data file in a table. Writers must ensure that there is at most one 
deletion vector per data file and must merge new deletes with existing vectors 
or position delete files.
+
+Row-level delete files (both equality and position delete files) are valid 
Iceberg data files: files must use valid Iceberg formats, schemas, and column 
projection. It is recommended that these delete files are written using the 
table's default file format.
+
+Row-level delete files and deletion vectors are tracked by manifests. A 
separate set of manifests is used for delete files and DVs, but the same 
manifest schema is used for both data and delete manifests. Deletion vectors 
are tracked individually by file location, offset, and length within the 
containing file. Deletion vector metadata must include the referenced data file.
+
+Both position and equality delete files allow encoding deleted row values with 
a delete. This can be used to reconstruct a stream of changes to a table.
+
 
-Row-level delete files are valid Iceberg data files: files must use valid 
Iceberg formats, schemas, and column projection. It is recommended that delete 
files are written using the table's default file format.
+### Deletion Vectors
 
-Row-level delete files are tracked by manifests, like data files. A separate 
set of manifests is used for delete files, but the manifest schemas are 
identical.
+Deletion vectors identify deleted rows of a file by encoding deleted positions 
in a bitmap. A set bit at position P indicates that the row at position P is 
deleted.
 
-Both position and equality deletes allow encoding deleted row values with a 
delete. This can be used to reconstruct a stream of changes to a table.
+These vectors are stored using the `delete-vector-v1` blob definition from the 
[Puffin spec][puffin-spec].
 
+Deletion vectors support positive 64-bit positions, but are optimized for 
cases where most positions fit in 32 bits by using a collection of 32-bit 
Roaring bitmaps. 64-bit positions are divided into a 32-bit "key" using the 
most significant 4 bytes and a 32-bit sub-position using the least significant 
4 bytes. For each key in the set of positions, a 32-bit Roaring bitmap is 
maintained to store a set of 32-bit sub-positions for that key.
+
+To test whether a certain position is set, its most significant 4 bytes (the 
key) are used to find a 32-bit bitmap and the least significant 4 bytes (the 
sub-position) are tested for inclusion in the bitmap. If a bitmap is not found 
for the key, then it is not set.
+
+Delete manifests track deletion vectors individually by the containing file 
location (`file_path`), starting offset of the DV magic bytes (`blob_offset`), 
and total length of the deletion vector blob (`blob_size_in_bytes`). Multiple 
deletion vectors can be stored in the same file. There are no restrictions on 
the data files that can be referenced by deletion vectors in the same Puffin 
file.
+
+At most one deletion vector is allowed per data file in a table. If a DV is 
written for a data file, it must replace all previously written position delete 
files so that when a DV is present, readers can safely ignore matching position 
delete files.
+
+
+[puffin-spec]: https://iceberg.apache.org/puffin-spec/
 
 ### Position Delete Files
 
 Position-based delete files identify deleted rows by file and position in one 
or more data files, and may optionally contain the deleted row.
 
+_Note: Position delete files are **deprecated** in v3. Existing position 
deletes must be written to delete vectors when updating the position deletes 
for a data file._

Review Comment:
   Is there a technical reason to force deprecation here?



-- 
This is an automated message from the Apache Git Service.
To respond to the m

Re: [I] manifest exception [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] closed issue #8994: manifest exception
URL: https://github.com/apache/iceberg/issues/8994


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Iceberg Glue - Timeouts (maybe others client side error cases) can result in missing metadata_location [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9618:
URL: https://github.com/apache/iceberg/issues/9618#issuecomment-2409443905

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Migrate RESTCatalogServlet to use jakarta.* package for Spring boot 3 [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9626:
URL: https://github.com/apache/iceberg/issues/9626#issuecomment-2409443950

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink: Optionally Overwrite All Partitions [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9644:
URL: https://github.com/apache/iceberg/pull/9644#issuecomment-2409443994

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark 3.5: Add a procedure to remove corrupt snapshots [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9645:
URL: https://github.com/apache/iceberg/pull/9645#issuecomment-2409444041

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] manifest exception [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #8994:
URL: https://github.com/apache/iceberg/issues/8994#issuecomment-2409443156

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] I can't find any detailed explanation about column metric options on the official docs for Iceberg configuration [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #8995:
URL: https://github.com/apache/iceberg/issues/8995#issuecomment-2409443202

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Doc Bug: Iceberg Flink Example uses unsupported UNIQUE constraint [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #8997:
URL: https://github.com/apache/iceberg/issues/8997#issuecomment-2409443247

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Doc Bug: Iceberg Flink Example uses unsupported UNIQUE constraint [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] closed issue #8997: Doc Bug: Iceberg Flink Example uses 
unsupported UNIQUE constraint 
URL: https://github.com/apache/iceberg/issues/8997


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] I can't find any detailed explanation about column metric options on the official docs for Iceberg configuration [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] closed issue #8995: I can't find any detailed explanation 
about column metric options on the official docs for Iceberg configuration
URL: https://github.com/apache/iceberg/issues/8995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



singhpk234 commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798637028


##
core/src/main/java/org/apache/iceberg/rest/RESTFileScanTaskParser.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.util.List;
+import org.apache.iceberg.BaseFileScanTask;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.ExpressionParser;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.ResidualEvaluator;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+
+public class RESTFileScanTaskParser {
+  private static final String DATA_FILE = "data-file";
+  private static final String DELETE_FILE_REFERENCES = 
"delete-file-references";
+  private static final String RESIDUAL = "residual-filter";
+
+  private RESTFileScanTaskParser() {}
+
+  public static void toJson(
+  FileScanTask fileScanTask, List deleteFiles, JsonGenerator 
generator)
+  throws IOException {
+Preconditions.checkArgument(fileScanTask != null, "Invalid file scan task: 
null");
+Preconditions.checkArgument(generator != null, "Invalid JSON generator: 
null");
+
+generator.writeStartObject();
+generator.writeFieldName(DATA_FILE);
+RESTContentFileParser.toJson(fileScanTask.file(), generator);
+
+// TODO revisit this logic
+if (deleteFiles != null) {
+  generator.writeArrayFieldStart(DELETE_FILE_REFERENCES);
+  for (int delIndex = 0; delIndex < deleteFiles.size(); delIndex++) {
+generator.writeNumber(delIndex);
+  }
+  generator.writeEndArray();
+}
+if (fileScanTask.residual() != null) {
+  generator.writeFieldName(RESIDUAL);
+  ExpressionParser.toJson(fileScanTask.residual(), generator);
+}
+generator.writeEndObject();
+  }
+
+  public static FileScanTask fromJson(JsonNode jsonNode, List 
allDeleteFiles) {
+Preconditions.checkArgument(jsonNode != null, "Invalid JSON node for file 
scan task: null");
+Preconditions.checkArgument(
+jsonNode.isObject(), "Invalid JSON node for file scan task: non-object 
(%s)", jsonNode);
+
+DataFile dataFile = (DataFile) 
RESTContentFileParser.fromJson(jsonNode.get(DATA_FILE));
+
+DeleteFile[] matchedDeleteFiles = null;
+List deleteFileReferences = null;
+if (jsonNode.has(DELETE_FILE_REFERENCES)) {
+  ImmutableList.Builder deleteFileReferencesBuilder = 
ImmutableList.builder();
+  JsonNode deletesArray = jsonNode.get(DELETE_FILE_REFERENCES);
+  for (JsonNode deleteRef : deletesArray) {
+deleteFileReferencesBuilder.add(deleteRef);
+  }
+  deleteFileReferences = deleteFileReferencesBuilder.build();
+}
+
+if (deleteFileReferences != null) {
+  ImmutableList.Builder matchedDeleteFilesBuilder = 
ImmutableList.builder();
+  for (Integer deleteFileIdx : deleteFileReferences) {
+matchedDeleteFilesBuilder.add(allDeleteFiles.get(deleteFileIdx));
+  }
+  matchedDeleteFiles = (DeleteFile[]) 
matchedDeleteFilesBuilder.build().stream().toArray();
+}
+
+// TODO revisit this in spec
+Expression filter = Expressions.alwaysTrue();
+if (jsonNode.has(RESIDUAL)) {
+  filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL));
+}
+
+ResidualEvaluator residualEvaluator = ResidualEvaluator.of(filter);
+
+// TODO at the time of creation we dont have the schemaString and 
specString so can we avoid
+// setting this
+// will need to refresh before returning closed iterable of tasks, for now 
put place holder null
+BaseFileScanTask baseFileScanTask =
+new BaseFileScanTask(dataFile, matchedDeleteFiles, null, null, 
residualEvaluator);

Review Comment:
   [doubt] These fileScanTasks can be belong to diff snapshots (lets say we

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



singhpk234 commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798637028


##
core/src/main/java/org/apache/iceberg/rest/RESTFileScanTaskParser.java:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.util.List;
+import org.apache.iceberg.BaseFileScanTask;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DeleteFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.ExpressionParser;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.ResidualEvaluator;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+
+public class RESTFileScanTaskParser {
+  private static final String DATA_FILE = "data-file";
+  private static final String DELETE_FILE_REFERENCES = 
"delete-file-references";
+  private static final String RESIDUAL = "residual-filter";
+
+  private RESTFileScanTaskParser() {}
+
+  public static void toJson(
+  FileScanTask fileScanTask, List deleteFiles, JsonGenerator 
generator)
+  throws IOException {
+Preconditions.checkArgument(fileScanTask != null, "Invalid file scan task: 
null");
+Preconditions.checkArgument(generator != null, "Invalid JSON generator: 
null");
+
+generator.writeStartObject();
+generator.writeFieldName(DATA_FILE);
+RESTContentFileParser.toJson(fileScanTask.file(), generator);
+
+// TODO revisit this logic
+if (deleteFiles != null) {
+  generator.writeArrayFieldStart(DELETE_FILE_REFERENCES);
+  for (int delIndex = 0; delIndex < deleteFiles.size(); delIndex++) {
+generator.writeNumber(delIndex);
+  }
+  generator.writeEndArray();
+}
+if (fileScanTask.residual() != null) {
+  generator.writeFieldName(RESIDUAL);
+  ExpressionParser.toJson(fileScanTask.residual(), generator);
+}
+generator.writeEndObject();
+  }
+
+  public static FileScanTask fromJson(JsonNode jsonNode, List 
allDeleteFiles) {
+Preconditions.checkArgument(jsonNode != null, "Invalid JSON node for file 
scan task: null");
+Preconditions.checkArgument(
+jsonNode.isObject(), "Invalid JSON node for file scan task: non-object 
(%s)", jsonNode);
+
+DataFile dataFile = (DataFile) 
RESTContentFileParser.fromJson(jsonNode.get(DATA_FILE));
+
+DeleteFile[] matchedDeleteFiles = null;
+List deleteFileReferences = null;
+if (jsonNode.has(DELETE_FILE_REFERENCES)) {
+  ImmutableList.Builder deleteFileReferencesBuilder = 
ImmutableList.builder();
+  JsonNode deletesArray = jsonNode.get(DELETE_FILE_REFERENCES);
+  for (JsonNode deleteRef : deletesArray) {
+deleteFileReferencesBuilder.add(deleteRef);
+  }
+  deleteFileReferences = deleteFileReferencesBuilder.build();
+}
+
+if (deleteFileReferences != null) {
+  ImmutableList.Builder matchedDeleteFilesBuilder = 
ImmutableList.builder();
+  for (Integer deleteFileIdx : deleteFileReferences) {
+matchedDeleteFilesBuilder.add(allDeleteFiles.get(deleteFileIdx));
+  }
+  matchedDeleteFiles = (DeleteFile[]) 
matchedDeleteFilesBuilder.build().stream().toArray();
+}
+
+// TODO revisit this in spec
+Expression filter = Expressions.alwaysTrue();
+if (jsonNode.has(RESIDUAL)) {
+  filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL));
+}
+
+ResidualEvaluator residualEvaluator = ResidualEvaluator.of(filter);
+
+// TODO at the time of creation we dont have the schemaString and 
specString so can we avoid
+// setting this
+// will need to refresh before returning closed iterable of tasks, for now 
put place holder null
+BaseFileScanTask baseFileScanTask =
+new BaseFileScanTask(dataFile, matchedDeleteFiles, null, null, 
residualEvaluator);

Review Comment:
   [doubt] These fileScanTasks can be belong to diff snapshots and we would

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



amogh-jahagirdar commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798635808


##
core/src/main/java/org/apache/iceberg/RESTTable.java:
##
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Map;
+import java.util.function.Supplier;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.metrics.MetricsReporter;
+import org.apache.iceberg.rest.RESTClient;
+import org.apache.iceberg.rest.ResourcePaths;
+
+public class RESTTable extends BaseTable {

Review Comment:
   Sounds good, I can understand the appeal of the `RESTTable` concept, 
especially with being able to override the operation implementations. I'm not 
very against it, just trying to avoid any unnecessary public classes being 
exposed, if there's somehow we can handle the redirection to the REST 
implementation internally of those things you mentioned. 
   
   I'd say let's stick with what you have for now, as we write tests and see 
more integration code with engines like Spark, we can determine what's the 
right pattern. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Can we make commits inside compaction jobs with partial-progress.enabled sequential to avoid CommitFailedException? [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9687:
URL: https://github.com/apache/iceberg/issues/9687#issuecomment-2409444642

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] java.lang.IllegalArgumentException: requirement failed: length (-6235972) cannot be smaller than -1 [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9689:
URL: https://github.com/apache/iceberg/issues/9689#issuecomment-2409444689

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Pass input file into iterators to get the file name [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9691:
URL: https://github.com/apache/iceberg/pull/9691#issuecomment-2409444741

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Docs: Go over docs to check rendering of pages/sections [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9657:
URL: https://github.com/apache/iceberg/issues/9657#issuecomment-2409444136

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Docs: Add Mandarin translation of the docs site [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9665:
URL: https://github.com/apache/iceberg/issues/9665#issuecomment-2409444237

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Website: Add release schedule on the releases page [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9666:
URL: https://github.com/apache/iceberg/pull/9666#issuecomment-2409444288

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Operations on partition columns in `WHERE` clause not used in pruning [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9678:
URL: https://github.com/apache/iceberg/issues/9678#issuecomment-240952

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Support for pushdown like filter (endsWith and contains) [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9683:
URL: https://github.com/apache/iceberg/pull/9683#issuecomment-2409444593

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Bump junit from 5.10.1 to 5.10.2 [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9699:
URL: https://github.com/apache/iceberg/pull/9699#issuecomment-2409444871

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] HMS lock timeout [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9654:
URL: https://github.com/apache/iceberg/issues/9654#issuecomment-2409444091

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] HIVE-28021: escape percent symbol [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9667:
URL: https://github.com/apache/iceberg/pull/9667#issuecomment-2409444347

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Iceberg Rewrite DataFiles unmanageable behavior [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on issue #9674:
URL: https://github.com/apache/iceberg/issues/9674#issuecomment-240901

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Flink: Made IcebergFilesCommitter work with single phase commit [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9694:
URL: https://github.com/apache/iceberg/pull/9694#issuecomment-2409444787

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Docs: Deprecate distinct_counts since it is no longer used in codebase [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9680:
URL: https://github.com/apache/iceberg/pull/9680#issuecomment-240998

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Fix header links with underscores in title [iceberg]

2024-10-13 Thread via GitHub



github-actions[bot] commented on PR #9697:
URL: https://github.com/apache/iceberg/pull/9697#issuecomment-2409444828

   This pull request has been marked as stale due to 30 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@iceberg.apache.org list. Thank you for your 
contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



amogh-jahagirdar commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798639774


##
core/src/main/java/org/apache/iceberg/RESTPlanningMode.java:
##
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Locale;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+public enum RESTPlanningMode {
+  REQUIRED("required"),
+  SUPPORTED("supported"),
+  UNSUPPORTED("unsupported");

Review Comment:
   Sounds good. After some more thought, my 2c is that I think I'd rather us 
try and get the model + client side changes into 1.7 rather than expand the 
scope with this aspect since it's quite useful without these things defined. 
I'd rather not have client side changes depend on a spec change decision.
   
   For now, I think keeping it simple with a catalog client side property for 
controlling if server side planning is performed is the way.
   
   Once the model and client side changes are in, I think it'd make total sense 
to revisit the spec changes you mentioned.
   
   cc @rdblue @danielcweeks for their thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



amogh-jahagirdar commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798639774


##
core/src/main/java/org/apache/iceberg/RESTPlanningMode.java:
##
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Locale;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+public enum RESTPlanningMode {
+  REQUIRED("required"),
+  SUPPORTED("supported"),
+  UNSUPPORTED("unsupported");

Review Comment:
   Sounds good. After some more thought, my 2c is that I think I'd rather us 
try and get the model + client side changes into 1.7 rather than expand the 
scope with this aspect since it's quite useful without these things defined. 
I'd rather not have client side changes depend on another spec change decision.
   
   For now, I think keeping it simple with a catalog client side property for 
controlling if server side planning is performed is the way forward.
   
   Once the model and client side changes are in, I think it'd make total sense 
to revisit the spec changes you mentioned.
   
   cc @rdblue @danielcweeks for their thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



amogh-jahagirdar commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798639774


##
core/src/main/java/org/apache/iceberg/RESTPlanningMode.java:
##
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Locale;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+public enum RESTPlanningMode {
+  REQUIRED("required"),
+  SUPPORTED("supported"),
+  UNSUPPORTED("unsupported");

Review Comment:
   Sounds good. After some more thought, my 2c is that I think I'd rather us 
try and get the model + client side changes into 1.7 rather than expand the 
scope with this aspect since it's quite useful without these things defined. 
I'd rather not have client side changes depend on a spec change decision.
   
   For now, I think keeping it simple with a catalog client side property for 
controlling if server side planning is performed is the way forward.
   
   Once the model and client side changes are in, I think it'd make total sense 
to revisit the spec changes you mentioned.
   
   cc @rdblue @danielcweeks for their thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



amogh-jahagirdar commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798639774


##
core/src/main/java/org/apache/iceberg/RESTPlanningMode.java:
##
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Locale;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+public enum RESTPlanningMode {
+  REQUIRED("required"),
+  SUPPORTED("supported"),
+  UNSUPPORTED("unsupported");

Review Comment:
   Sounds good. After some more thought, my 2c is that I think I'd rather us 
try and get the model + client side changes into 1.7 rather than expand the 
scope with this aspect since it's quite useful without these things defined. 
I'd rather not have client side changes depend on another spec change decision.
   
   For now, I think keeping it simple with a catalog client side property for 
controlling if server side planning is performed is the way forward.
   
   Once the model and client side changes are in, I think it'd make total sense 
to revisit the spec changes you mentioned and these specific planning mode 
changes.
   
   cc @rdblue @danielcweeks for their thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Merge new position deletes with old deletes during writing [iceberg]

2024-10-13 Thread via GitHub



amogh-jahagirdar commented on code in PR #11273:
URL: https://github.com/apache/iceberg/pull/11273#discussion_r1798640841


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java:
##
@@ -158,6 +163,26 @@ public void filter(Predicate[] predicates) {
 }
   }
 
+  protected Map dataToFileScopedDeletes() {
+if (dataToFileScopedDeletes == null) {
+  dataToFileScopedDeletes = Maps.newHashMap();
+  for (ScanTask task : tasks()) {
+FileScanTask fileScanTask = task.asFileScanTask();
+List fileScopedDeletes =
+fileScanTask.deletes().stream()
+.filter(file -> 
ContentFileUtil.referencedDataFileLocation(file) != null)

Review Comment:
   Agreed, added a `isFileScopedDelete` API to `ContentFileUtil`!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



rahil-c commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798718031


##
core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java:
##
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.ContentFile;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileContent;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.GenericDataFile;
+import org.apache.iceberg.GenericDeleteFile;
+import org.apache.iceberg.Metrics;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.SingleValueParser;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RESTContentFileParser {
+  private static final String SPEC_ID = "spec-id";
+  private static final String CONTENT = "content";
+  private static final String FILE_PATH = "file-path";
+  private static final String FILE_FORMAT = "file-format";
+  private static final String PARTITION = "partition";
+  private static final String RECORD_COUNT = "record-count";
+  private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes";
+  private static final String COLUMN_SIZES = "column-sizes";
+  private static final String VALUE_COUNTS = "value-counts";
+  private static final String NULL_VALUE_COUNTS = "null-value-counts";
+  private static final String NAN_VALUE_COUNTS = "nan-value-counts";
+  private static final String LOWER_BOUNDS = "lower-bounds";
+  private static final String UPPER_BOUNDS = "upper-bounds";
+  private static final String KEY_METADATA = "key-metadata";
+  private static final String SPLIT_OFFSETS = "split-offsets";
+  private static final String EQUALITY_IDS = "equality-ids";
+  private static final String SORT_ORDER_ID = "sort-order-id";
+
+  private RESTContentFileParser() {}
+
+  public static String toJson(ContentFile contentFile) {
+return JsonUtil.generate(
+generator -> RESTContentFileParser.toJson(contentFile, generator), 
false);
+  }
+
+  public static void toJson(ContentFile contentFile, JsonGenerator 
generator)
+  throws IOException {
+Preconditions.checkArgument(contentFile != null, "Invalid content file: 
null");
+Preconditions.checkArgument(generator != null, "Invalid JSON generator: 
null");
+
+generator.writeStartObject();
+
+generator.writeNumberField(SPEC_ID, contentFile.specId());
+generator.writeStringField(CONTENT, contentFile.content().name());
+generator.writeStringField(FILE_PATH, contentFile.path().toString());
+generator.writeStringField(FILE_FORMAT, contentFile.format().name());
+
+generator.writeFieldName(PARTITION);
+
+// TODO at the time of serialization we dont have the partition spec we 
just have spec id.
+// we will need to get the spec from table metadata using spec id.
+// or we will need to send parition spec, put null here for now until 
refresh
+SingleValueParser.toJson(null, contentFile.partition(), generator);
+
+generator.writeNumberField(FILE_SIZE_IN_BYTES, 
contentFile.fileSizeInBytes());
+
+metricsToJson(contentFile, generator);
+
+if (contentFile.keyMetadata() != null) {
+  generator.writeFieldName(KEY_METADATA);
+  SingleValueParser.toJson(DataFile.KEY_METADATA.type(), 
contentFile.keyMetadata(), generator);
+}
+
+if (contentFile.splitOffsets() != null) {
+  JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), 
generator);
+}
+
+if (contentFile.equalityFieldIds() != null) {
+  JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), 
generator);
+}
+
+if (contentFile.sortOrderId() != null) {
+  generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId());
+}
+
+generator.writeEndObject();
+  }
+
+  public static ContentFile fromJson(JsonNode jsonNode) {
+Preconditions.checkArgument(jsonNode != null, "Invali

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



singhpk234 commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798744690


##
core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java:
##
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.ContentFile;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileContent;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.GenericDataFile;
+import org.apache.iceberg.GenericDeleteFile;
+import org.apache.iceberg.Metrics;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.SingleValueParser;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RESTContentFileParser {
+  private static final String SPEC_ID = "spec-id";
+  private static final String CONTENT = "content";
+  private static final String FILE_PATH = "file-path";
+  private static final String FILE_FORMAT = "file-format";
+  private static final String PARTITION = "partition";
+  private static final String RECORD_COUNT = "record-count";
+  private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes";
+  private static final String COLUMN_SIZES = "column-sizes";
+  private static final String VALUE_COUNTS = "value-counts";
+  private static final String NULL_VALUE_COUNTS = "null-value-counts";
+  private static final String NAN_VALUE_COUNTS = "nan-value-counts";
+  private static final String LOWER_BOUNDS = "lower-bounds";
+  private static final String UPPER_BOUNDS = "upper-bounds";
+  private static final String KEY_METADATA = "key-metadata";
+  private static final String SPLIT_OFFSETS = "split-offsets";
+  private static final String EQUALITY_IDS = "equality-ids";
+  private static final String SORT_ORDER_ID = "sort-order-id";
+
+  private RESTContentFileParser() {}
+
+  public static String toJson(ContentFile contentFile) {
+return JsonUtil.generate(
+generator -> RESTContentFileParser.toJson(contentFile, generator), 
false);
+  }
+
+  public static void toJson(ContentFile contentFile, JsonGenerator 
generator)
+  throws IOException {
+Preconditions.checkArgument(contentFile != null, "Invalid content file: 
null");
+Preconditions.checkArgument(generator != null, "Invalid JSON generator: 
null");
+
+generator.writeStartObject();
+
+generator.writeNumberField(SPEC_ID, contentFile.specId());
+generator.writeStringField(CONTENT, contentFile.content().name());
+generator.writeStringField(FILE_PATH, contentFile.path().toString());
+generator.writeStringField(FILE_FORMAT, contentFile.format().name());
+
+generator.writeFieldName(PARTITION);
+
+// TODO at the time of serialization we dont have the partition spec we 
just have spec id.
+// we will need to get the spec from table metadata using spec id.
+// or we will need to send parition spec, put null here for now until 
refresh
+SingleValueParser.toJson(null, contentFile.partition(), generator);
+
+generator.writeNumberField(FILE_SIZE_IN_BYTES, 
contentFile.fileSizeInBytes());
+
+metricsToJson(contentFile, generator);
+
+if (contentFile.keyMetadata() != null) {
+  generator.writeFieldName(KEY_METADATA);
+  SingleValueParser.toJson(DataFile.KEY_METADATA.type(), 
contentFile.keyMetadata(), generator);
+}
+
+if (contentFile.splitOffsets() != null) {
+  JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), 
generator);
+}
+
+if (contentFile.equalityFieldIds() != null) {
+  JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), 
generator);
+}
+
+if (contentFile.sortOrderId() != null) {
+  generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId());
+}
+
+generator.writeEndObject();
+  }
+
+  public static ContentFile fromJson(JsonNode jsonNode) {
+Preconditions.checkArgument(jsonNode != null, "Inv

Re: [PR] API, Core: Add scan planning apis to REST Catalog [iceberg]

2024-10-13 Thread via GitHub



singhpk234 commented on code in PR #11180:
URL: https://github.com/apache/iceberg/pull/11180#discussion_r1798744690


##
core/src/main/java/org/apache/iceberg/rest/RESTContentFileParser.java:
##
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.rest;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.ContentFile;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileContent;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.GenericDataFile;
+import org.apache.iceberg.GenericDeleteFile;
+import org.apache.iceberg.Metrics;
+import org.apache.iceberg.PartitionData;
+import org.apache.iceberg.SingleValueParser;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+public class RESTContentFileParser {
+  private static final String SPEC_ID = "spec-id";
+  private static final String CONTENT = "content";
+  private static final String FILE_PATH = "file-path";
+  private static final String FILE_FORMAT = "file-format";
+  private static final String PARTITION = "partition";
+  private static final String RECORD_COUNT = "record-count";
+  private static final String FILE_SIZE_IN_BYTES = "file-size-in-bytes";
+  private static final String COLUMN_SIZES = "column-sizes";
+  private static final String VALUE_COUNTS = "value-counts";
+  private static final String NULL_VALUE_COUNTS = "null-value-counts";
+  private static final String NAN_VALUE_COUNTS = "nan-value-counts";
+  private static final String LOWER_BOUNDS = "lower-bounds";
+  private static final String UPPER_BOUNDS = "upper-bounds";
+  private static final String KEY_METADATA = "key-metadata";
+  private static final String SPLIT_OFFSETS = "split-offsets";
+  private static final String EQUALITY_IDS = "equality-ids";
+  private static final String SORT_ORDER_ID = "sort-order-id";
+
+  private RESTContentFileParser() {}
+
+  public static String toJson(ContentFile contentFile) {
+return JsonUtil.generate(
+generator -> RESTContentFileParser.toJson(contentFile, generator), 
false);
+  }
+
+  public static void toJson(ContentFile contentFile, JsonGenerator 
generator)
+  throws IOException {
+Preconditions.checkArgument(contentFile != null, "Invalid content file: 
null");
+Preconditions.checkArgument(generator != null, "Invalid JSON generator: 
null");
+
+generator.writeStartObject();
+
+generator.writeNumberField(SPEC_ID, contentFile.specId());
+generator.writeStringField(CONTENT, contentFile.content().name());
+generator.writeStringField(FILE_PATH, contentFile.path().toString());
+generator.writeStringField(FILE_FORMAT, contentFile.format().name());
+
+generator.writeFieldName(PARTITION);
+
+// TODO at the time of serialization we dont have the partition spec we 
just have spec id.
+// we will need to get the spec from table metadata using spec id.
+// or we will need to send parition spec, put null here for now until 
refresh
+SingleValueParser.toJson(null, contentFile.partition(), generator);
+
+generator.writeNumberField(FILE_SIZE_IN_BYTES, 
contentFile.fileSizeInBytes());
+
+metricsToJson(contentFile, generator);
+
+if (contentFile.keyMetadata() != null) {
+  generator.writeFieldName(KEY_METADATA);
+  SingleValueParser.toJson(DataFile.KEY_METADATA.type(), 
contentFile.keyMetadata(), generator);
+}
+
+if (contentFile.splitOffsets() != null) {
+  JsonUtil.writeLongArray(SPLIT_OFFSETS, contentFile.splitOffsets(), 
generator);
+}
+
+if (contentFile.equalityFieldIds() != null) {
+  JsonUtil.writeIntegerArray(EQUALITY_IDS, contentFile.equalityFieldIds(), 
generator);
+}
+
+if (contentFile.sortOrderId() != null) {
+  generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId());
+}
+
+generator.writeEndObject();
+  }
+
+  public static ContentFile fromJson(JsonNode jsonNode) {
+Preconditions.checkArgument(jsonNode != null, "Inv

Re: [PR] Arrow: Deprecate unused fixed width binary reader classes [iceberg]

2024-10-13 Thread via GitHub



nastra merged PR #11292:
URL: https://github.com/apache/iceberg/pull/11292


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec v3: Add deletion vectors to the table spec [iceberg]

2024-10-13 Thread via GitHub



wgtmac commented on code in PR #11240:
URL: https://github.com/apache/iceberg/pull/11240#discussion_r1798786736


##
format/spec.md:
##
@@ -454,35 +457,40 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 
 `data_file` is a struct with the following fields:
 
-| v1 | v2 | Field id, name| Type   
  | Description |
-| -- | -- 
|---|--|-|
-|| _required_ | **`134  content`**| `int` with 
meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of 
content stored by the data file: data, equality deletes, or position deletes 
(all v1 files are data files) |
-| _required_ | _required_ | **`100  file_path`**  | `string`   
  | Full URI for the file with FS scheme |
-| _required_ | _required_ | **`101  file_format`**| `string`   
  | String file format name, avro, orc or parquet |
-| _required_ | _required_ | **`102  partition`**  | `struct<...>`  
  | Partition data tuple, schema based on the partition spec output 
using partition field ids for the struct field ids |
-| _required_ | _required_ | **`103  record_count`**   | `long` 
  | Number of records in this file |
-| _required_ | _required_ | **`104  file_size_in_bytes`** | `long` 
  | Total file size in bytes |
-| _required_ || ~~**`105 block_size_in_bytes`**~~ | `long` 
  | **Deprecated. Always write a default in v1. Do not write in 
v2.** |
-| _optional_ || ~~**`106  file_ordinal`**~~   | `int`  
  | **Deprecated. Do not write.** |
-| _optional_ || ~~**`107  sort_columns`**~~   | `list<112: 
int>` | **Deprecated. Do not write.** |
-| _optional_ | _optional_ | **`108  column_sizes`**   | `map<117: int, 
118: long>`   | Map from column id to the total size on disk of all regions 
that store the column. Does not include bytes necessary to read other columns, 
like footers. Leave null for row-oriented formats (Avro) |
-| _optional_ | _optional_ | **`109  value_counts`**   | `map<119: int, 
120: long>`   | Map from column id to number of values in the column (including 
null and NaN values) |
-| _optional_ | _optional_ | **`110  null_value_counts`**  | `map<121: int, 
122: long>`   | Map from column id to number of null values in the column |
-| _optional_ | _optional_ | **`137  nan_value_counts`**   | `map<138: int, 
139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ | _optional_ | **`111  distinct_counts`**| `map<123: int, 
124: long>`   | Map from column id to number of distinct values in the column; 
distinct counts must be derived using values in the file by counting or using 
sketches, but not using methods like merging existing distinct counts |
-| _optional_ | _optional_ | **`125  lower_bounds`**   | `map<126: int, 
127: binary>` | Map from column id to lower bound in the column serialized as 
binary [1]. Each value must be less than or equal to all non-null, non-NaN 
values in the column for the file [2] |
-| _optional_ | _optional_ | **`128  upper_bounds`**   | `map<129: int, 
130: binary>` | Map from column id to upper bound in the column serialized as 
binary [1]. Each value must be greater than or equal to all non-null, non-Nan 
values in the column for the file [2] |
-| _optional_ | _optional_ | **`131  key_metadata`**   | `binary`   
  | Implementation-specific key metadata for encryption |
-| _optional_ | _optional_ | **`132  split_offsets`**  | `list<133: 
long>`| Split offsets for the data file. For example, all row group 
offsets in a Parquet file. Must be sorted ascending |
-|| _optional_ | **`135  equality_ids`**   | `list<136: 
int>` | Field ids used to determine row equality in equality delete 
files. Required when `content=2` and should be null otherwise. Fields with ids 
listed in this column must be present in the delete file |
-| _optional_ | _optional_ | **`140  sort_order_id`**  | `int`  
  | ID representing sort order for this file [3]. |
+| v1 | v2 | v3 | Field id, name| 
Type | Description |
+| -- | -- | -- 
|---|--|-|
+|| _required_ | _required_ | **`134  content`**| 
`int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | 
Type of content stored by the data file: data, equality deletes, or position 
deletes (all v1 files are data files) |
+| _required_ | _required_ | _requi

Re: [I] If we replaced or dropped partition spec field and drop the corresponding column, we can't select table again [iceberg]

2024-10-13 Thread via GitHub



bknbkn commented on issue #11314:
URL: https://github.com/apache/iceberg/issues/11314#issuecomment-2409677616

   The reason for this problem seems to be that each spec uses the latest 
schema, and historical specs may not be able to find fields in the latest 
schema.
   
   I think it is necessary to persist the schma id into each spec. Based on 
this, each PartitionSpec can find its own schema when it is generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] docs: README uses iceberg-rust instead of we [iceberg-rust]

2024-10-13 Thread via GitHub



Xuanwo merged PR #667:
URL: https://github.com/apache/iceberg-rust/pull/667


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] chore(deps): Bump crate-ci/typos from 1.25.0 to 1.26.0 [iceberg-rust]

2024-10-13 Thread via GitHub



Xuanwo merged PR #668:
URL: https://github.com/apache/iceberg-rust/pull/668


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spec: Adds Row Lineage [iceberg]

2024-10-13 Thread via GitHub



wgtmac commented on code in PR #11130:
URL: https://github.com/apache/iceberg/pull/11130#discussion_r1798768067


##
format/spec.md:
##
@@ -598,6 +702,14 @@ Notes:
 1. Lower and upper bounds are serialized to bytes using the single-object 
serialization in Appendix D. The type of used to encode the value is the type 
of the partition field data.
 2. If -0.0 is a value of the partition field, the `lower_bound` must not be 
+0.0, and if +0.0 is a value of the partition field, the `upper_bound` must not 
be -0.0.
 
+ First Row ID Assignment
+
+Row ID inheritance is used when row lineage is enabled. When not enabled, a 
manifest's `first_row_id` must always be set to `null`. The rest of this 
section applies when row lineage is enabled.

Review Comment:
   A related question: if we revert the table to a snapshot before enabling the 
row lineage, should we disable row lineage? If not, what about `next_row_id`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] OpenAPI: Standardize credentials in loadTable/loadView responses [iceberg]

2024-10-13 Thread via GitHub



nastra commented on code in PR #10722:
URL: https://github.com/apache/iceberg/pull/10722#discussion_r1798841576


##
open-api/rest-catalog-open-api.yaml:
##
@@ -3129,6 +3145,11 @@ components:
  - `s3.secret-access-key`: secret for credentials that provide access 
to data in S3 
  - `s3.session-token`: if present, this value should be used for as 
the session token 
  - `s3.remote-signing-enabled`: if `true` remote signing should be 
performed as described in the `s3-signer-open-api.yaml` specification
+
+## Storage Credentials
+
+Credentials for ADLS / GCS / S3 / ... are provided through the 
`storage-credentials` field.

Review Comment:
   yes exactly, we're trying to move the docs to this new credential section



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates [iceberg]

2024-10-13 Thread via GitHub



nastra merged PR #11261:
URL: https://github.com/apache/iceberg/pull/11261


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Cannot delete column with nested field filter [iceberg]

2024-10-13 Thread via GitHub



nastra closed issue #7065: Cannot delete column with nested field filter
URL: https://github.com/apache/iceberg/issues/7065


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

71 matches

Mail list logo