[PR] build(deps): bump github.com/aws/aws-sdk-go-v2 from 1.21.2 to 1.22.1 [iceberg-go]
dependabot[bot] opened a new pull request, #27: URL: https://github.com/apache/iceberg-go/pull/27 Bumps [github.com/aws/aws-sdk-go-v2](https://github.com/aws/aws-sdk-go-v2) from 1.21.2 to 1.22.1. Commits https://github.com/aws/aws-sdk-go-v2/commit/ee5e3f05637540596cc7aab1359742000a8d533a";>ee5e3f0 Release 2023-11-01 https://github.com/aws/aws-sdk-go-v2/commit/b65c226f47aa1f837699664bdc65c3c3e3611765";>b65c226 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/7a194b9b0344774a5af100d11ea2066c5b0cf234";>7a194b9 Update API model https://github.com/aws/aws-sdk-go-v2/commit/0cb924a0007bc681d12f382a604368e0660827ee";>0cb924a Add support for configured endpoints. (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2328";>#2328) https://github.com/aws/aws-sdk-go-v2/commit/61039fea9cc9e080c53382850c87685b5406fd68";>61039fe Release 2023-10-31 https://github.com/aws/aws-sdk-go-v2/commit/797e0560769725635218fc30a2554c1bbaccc01b";>797e056 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/822585d3f621a7c5844584d8e471c32f852702aa";>822585d Update SDK's smithy-go dependency to v1.16.0 https://github.com/aws/aws-sdk-go-v2/commit/abf753db747dd256f3ee69712a19d1d3dc681f23";>abf753d Update API model https://github.com/aws/aws-sdk-go-v2/commit/99861c071109ce5ee4f1cb3b72ead2062b3bd86c";>99861c0 lang: bump minimum go version to 1.19 (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2338";>#2338) https://github.com/aws/aws-sdk-go-v2/commit/2ac0a53ac45acaadc4526fd25b643dc46032b02a";>2ac0a53 Release 2023-10-30 Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/v1.21.2...v1.22.1";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] build(deps): bump github.com/hamba/avro/v2 from 2.16.0 to 2.17.1 [iceberg-go]
dependabot[bot] opened a new pull request, #28: URL: https://github.com/apache/iceberg-go/pull/28 Bumps [github.com/hamba/avro/v2](https://github.com/hamba/avro) from 2.16.0 to 2.17.1. Release notes Sourced from https://github.com/hamba/avro/releases";>github.com/hamba/avro/v2's releases. v2.17.1 What's Changed fix: issue with dereferencing schemas by https://github.com/nrwiersma";>@nrwiersma in https://redirect.github.com/hamba/avro/pull/319";>hamba/avro#319 Full Changelog: https://github.com/hamba/avro/compare/v2.17.0...v2.17.1";>https://github.com/hamba/avro/compare/v2.17.0...v2.17.1 v2.17.0 What's Changed Allow tag style "original" for additional tags by https://github.com/founderio";>@founderio in https://redirect.github.com/hamba/avro/pull/313";>hamba/avro#313 Added Types methods to Protocol by https://github.com/EliaBracciSumo";>@EliaBracciSumo in https://redirect.github.com/hamba/avro/pull/315";>hamba/avro#315 New Contributors https://github.com/founderio";>@founderio made their first contribution in https://redirect.github.com/hamba/avro/pull/313";>hamba/avro#313 https://github.com/EliaBracciSumo";>@EliaBracciSumo made their first contribution in https://redirect.github.com/hamba/avro/pull/315";>hamba/avro#315 Full Changelog: https://github.com/hamba/avro/compare/v2.16.0...v2.17.0";>https://github.com/hamba/avro/compare/v2.16.0...v2.17.0 Commits https://github.com/hamba/avro/commit/0429db3bae0390938223d14e8b36737b5fb3ef3c";>0429db3 fix: issue with dereferencing schemas (https://redirect.github.com/hamba/avro/issues/319";>#319) https://github.com/hamba/avro/commit/50a7897f6ce66c9f9907128355c618468578bd2b";>50a7897 feat: added Types methods to Protocol (https://redirect.github.com/hamba/avro/issues/315";>#315) https://github.com/hamba/avro/commit/3ac44d5d4fbed8a582e47f6ba91a65bbc23fe5bd";>3ac44d5 feat: allow tag style "original" for additional tags (https://redirect.github.com/hamba/avro/issues/313";>#313) https://github.com/hamba/avro/commit/00fb9ace37cb66d12f28cc09fe4f58089574";>00fb9ac chore: add dependency groups (https://redirect.github.com/hamba/avro/issues/312";>#312) See full diff in https://github.com/hamba/avro/compare/v2.16.0...v2.17.1";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] build(deps): bump github.com/google/uuid from 1.3.1 to 1.4.0 [iceberg-go]
dependabot[bot] opened a new pull request, #29: URL: https://github.com/apache/iceberg-go/pull/29 Bumps [github.com/google/uuid](https://github.com/google/uuid) from 1.3.1 to 1.4.0. Release notes Sourced from https://github.com/google/uuid/releases";>github.com/google/uuid's releases. v1.4.0 https://github.com/google/uuid/compare/v1.3.1...v1.4.0";>1.4.0 (2023-10-26) Features UUIDs slice type with Strings() convenience method (https://redirect.github.com/google/uuid/issues/133";>#133) (https://github.com/google/uuid/commit/cd5fbbdd02f3e3467ac18940e07e062be1f864b4";>cd5fbbd) Fixes Clarify that Parse's job is to parse but not necessarily validate strings. (Documents current behavior) Changelog Sourced from https://github.com/google/uuid/blob/master/CHANGELOG.md";>github.com/google/uuid's changelog. https://github.com/google/uuid/compare/v1.3.1...v1.4.0";>1.4.0 (2023-10-26) Features UUIDs slice type with Strings() convenience method (https://redirect.github.com/google/uuid/issues/133";>#133) (https://github.com/google/uuid/commit/cd5fbbdd02f3e3467ac18940e07e062be1f864b4";>cd5fbbd) Fixes Clarify that Parse's job is to parse but not necessarily validate strings. (Documents current behavior) Commits https://github.com/google/uuid/commit/8de8764e294f072b7a2f1a209e88fdcdb1ebc875";>8de8764 chore(master): release 1.4.0 (https://redirect.github.com/google/uuid/issues/134";>#134) https://github.com/google/uuid/commit/7c22e97ff7647f3b21c3e0870ab335c3889de467";>7c22e97 Clarify the documentation of Parse to state its job is to parse, not validate... https://github.com/google/uuid/commit/cd5fbbdd02f3e3467ac18940e07e062be1f864b4";>cd5fbbd feat: UUIDs slice type with Strings() convenience method (https://redirect.github.com/google/uuid/issues/133";>#133) https://github.com/google/uuid/commit/47f5b3936c94efb365bdfc62716912ed9e66326f";>47f5b39 docs: fix a typo in CONTRIBUTING.md (https://redirect.github.com/google/uuid/issues/130";>#130) https://github.com/google/uuid/commit/542ddabd47d7bfa79359b7b4e2af7f975354e35f";>542ddab chore(tests): add Fuzz tests (https://redirect.github.com/google/uuid/issues/128";>#128) https://github.com/google/uuid/commit/06716f6a60da5ba158f1d53a8236a534968ff76e";>06716f6 chore(tests): Add json.Unmarshal test with empty value cases (https://redirect.github.com/google/uuid/issues/116";>#116) See full diff in https://github.com/google/uuid/compare/v1.3.1...v1.4.0";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubs
[PR] build(deps): bump github.com/wolfeidau/s3iofs from 1.3.0 to 1.3.1 [iceberg-go]
dependabot[bot] opened a new pull request, #31: URL: https://github.com/apache/iceberg-go/pull/31 Bumps [github.com/wolfeidau/s3iofs](https://github.com/wolfeidau/s3iofs) from 1.3.0 to 1.3.1. Release notes Sourced from https://github.com/wolfeidau/s3iofs/releases";>github.com/wolfeidau/s3iofs's releases. v1.3.1 What's Changed docs(README): add some badges with a godoc link by https://github.com/wolfeidau";>@wolfeidau in https://redirect.github.com/wolfeidau/s3iofs/pull/14";>wolfeidau/s3iofs#14 feat(testing): increase integration test coverage :rocket: by https://github.com/wolfeidau";>@wolfeidau in https://redirect.github.com/wolfeidau/s3iofs/pull/15";>wolfeidau/s3iofs#15 feat(tests): added flags for vscode to ensure integration test coverage works by https://github.com/wolfeidau";>@wolfeidau in https://redirect.github.com/wolfeidau/s3iofs/pull/16";>wolfeidau/s3iofs#16 chore(deps): upgrade go deps by https://github.com/wolfeidau";>@wolfeidau in https://redirect.github.com/wolfeidau/s3iofs/pull/19";>wolfeidau/s3iofs#19 chore(deps): upgrade go deps for integration tests by https://github.com/wolfeidau";>@wolfeidau in https://redirect.github.com/wolfeidau/s3iofs/pull/20";>wolfeidau/s3iofs#20 Full Changelog: https://github.com/wolfeidau/s3iofs/compare/v1.3.0...v1.3.1";>https://github.com/wolfeidau/s3iofs/compare/v1.3.0...v1.3.1 Commits https://github.com/wolfeidau/s3iofs/commit/710788272cd775c490622c9fd2d56a25ea138929";>7107882 Merge pull request https://redirect.github.com/wolfeidau/s3iofs/issues/20";>#20 from wolfeidau/chore_upgrade_integration_deps https://github.com/wolfeidau/s3iofs/commit/8e14816297b4761912d1f65d7b25ffa5145d1a41";>8e14816 chore(deps): upgrade go deps for integration tests https://github.com/wolfeidau/s3iofs/commit/87378762a59e2b2ec85822e8e217a4322771db39";>8737876 Merge pull request https://redirect.github.com/wolfeidau/s3iofs/issues/19";>#19 from wolfeidau/chore_oct_dep_upgrades https://github.com/wolfeidau/s3iofs/commit/5bcee15b28710992fea999ddacd931e206eccef2";>5bcee15 chore(deps): upgrade go deps https://github.com/wolfeidau/s3iofs/commit/ba8909f07876d88ae05ae3cec4756736bf185371";>ba8909f Merge pull request https://redirect.github.com/wolfeidau/s3iofs/issues/16";>#16 from wolfeidau/feat_vscode_test_coverage https://github.com/wolfeidau/s3iofs/commit/385abc4f78bff56a39ff673baaf8f306b19cfd40";>385abc4 feat(tests): added flags for vscode to ensure integration test coverage works https://github.com/wolfeidau/s3iofs/commit/405c8424b0cc3a8e5b65a870f9d5092188920b44";>405c842 Merge pull request https://redirect.github.com/wolfeidau/s3iofs/issues/15";>#15 from wolfeidau/feat_testing https://github.com/wolfeidau/s3iofs/commit/144df5813d4d373436dcced549f49dd82f4afffe";>144df58 feat(testing): increase integration test coverage :rocket: https://github.com/wolfeidau/s3iofs/commit/2281acecd4ee81ba62fed349555642fb4399e0e1";>2281ace Merge pull request https://redirect.github.com/wolfeidau/s3iofs/issues/14";>#14 from wolfeidau/docs_readme https://github.com/wolfeidau/s3iofs/commit/b9f30f1374dd7e1ba2f940e7058887f5c99b9874";>b9f30f1 docs(README): add some badges with a godoc link See full diff in https://github.com/wolfeidau/s3iofs/compare/v1.3.0...v1.3.1";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it
[PR] build(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.19.1 to 1.22.0 [iceberg-go]
dependabot[bot] opened a new pull request, #30: URL: https://github.com/apache/iceberg-go/pull/30 Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.19.1 to 1.22.0. Commits https://github.com/aws/aws-sdk-go-v2/commit/61039fea9cc9e080c53382850c87685b5406fd68";>61039fe Release 2023-10-31 https://github.com/aws/aws-sdk-go-v2/commit/797e0560769725635218fc30a2554c1bbaccc01b";>797e056 Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/822585d3f621a7c5844584d8e471c32f852702aa";>822585d Update SDK's smithy-go dependency to v1.16.0 https://github.com/aws/aws-sdk-go-v2/commit/abf753db747dd256f3ee69712a19d1d3dc681f23";>abf753d Update API model https://github.com/aws/aws-sdk-go-v2/commit/99861c071109ce5ee4f1cb3b72ead2062b3bd86c";>99861c0 lang: bump minimum go version to 1.19 (https://redirect.github.com/aws/aws-sdk-go-v2/issues/2338";>#2338) https://github.com/aws/aws-sdk-go-v2/commit/2ac0a53ac45acaadc4526fd25b643dc46032b02a";>2ac0a53 Release 2023-10-30 https://github.com/aws/aws-sdk-go-v2/commit/c10aa0ad45a155d7a6a9968894aed0d8e1cb4e81";>c10aa0a Regenerated Clients https://github.com/aws/aws-sdk-go-v2/commit/9c456c10923952d6bd1d7d59ded3d70588e1ff36";>9c456c1 Update API model https://github.com/aws/aws-sdk-go-v2/commit/3cb5dc1d777c4e28cd360728c45e8b5aa2a7b2b0";>3cb5dc1 Release 2023-10-27 https://github.com/aws/aws-sdk-go-v2/commit/9b3ad7b1e6ce72730896fe7c9d165543ff158ed3";>9b3ad7b Regenerated Clients Additional commits viewable in https://github.com/aws/aws-sdk-go-v2/compare/v1.19.1...v1.22.0";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Substitue in memory data struct's timestamp type for DataTime rather i64 to simplify usage. [iceberg-rust]
my-vegetable-has-exploded commented on issue #90: URL: https://github.com/apache/iceberg-rust/issues/90#issuecomment-1793775352 I‘d like to have a try. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Substitue in memory data struct's timestamp type for DataTime rather i64 to simplify usage. [iceberg-rust]
liurenjie1024 commented on issue #90: URL: https://github.com/apache/iceberg-rust/issues/90#issuecomment-1793783924 > I‘d like to have a try. Sure, welcome to contribute! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Replace black by Ruff Formatter [iceberg-python]
rdblue commented on PR #127: URL: https://github.com/apache/iceberg-python/pull/127#issuecomment-1793797091 Looks fine overall, but it seems like too many changes with string normalization. Why force string normalization? That's going to cause a ton of pull requests to fail formatting validation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support of before and after actions in preorderschema traversal [iceberg-python]
rdblue commented on PR #42: URL: https://github.com/apache/iceberg-python/pull/42#issuecomment-1793797992 Maybe it's me, but I don't understand the value of adding before and after callbacks to this visitor. A node's children are traversed when the future is called and that allows you to do whatever you want before and after further schema traversal. I think it makes more sense to consolidate the logic in the usual methods rather than use callbacks in this case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support of before and after actions in preorderschema traversal [iceberg-python]
MehulBatra commented on PR #42: URL: https://github.com/apache/iceberg-python/pull/42#issuecomment-1793802279 > Maybe it's me, but I don't understand the value of adding before and after callbacks to this visitor. A node's children are traversed when the future is called and that allows you to do whatever you want before and after further schema traversal. I think it makes more sense to consolidate the logic in the usual methods rather than use callbacks in this case. It's still half baked I am working on it, but thanks for the feedback, will that into consideration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382618622 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') Review Comment: In Java, this is [`deleted-data-files`](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L31) because the property was created before we had delete files (so delete and remove were the same thing). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382619167 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') Review Comment: Looks like this is correct in the `_update_totals` call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382619372 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382619528 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382619653 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382620295 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382620579 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,199 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +self.added_size += data_file.file_size_in_bytes +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.deleted_records, 'deleted-records') +set_non_zero(p
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382620579 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,199 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +self.added_size += data_file.file_size_in_bytes +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.deleted_records, 'deleted-records') +set_non_zero(p
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382620743 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382620705 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382620882 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: Review Comment: `DataFile` is used for both data and delete files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382621051 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: Review Comment: Nit: `add_file` and `removed_file` use different tenses. It would be better to use `added_file` and `removed_file` or `add_file` and `remove_file`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382621288 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: Review Comment: Since there is no way to add manifests right now, should we just remove this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
rdblue commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1382621381 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: Review Comment: Nit: this checks for positive, not just non-zero. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add flake8-pie to ruff [iceberg-python]
rdblue merged PR #86: URL: https://github.com/apache/iceberg-python/pull/86 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update pre-commit [iceberg-python]
rdblue merged PR #85: URL: https://github.com/apache/iceberg-python/pull/85 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update pre-commit [iceberg-python]
rdblue commented on PR #85: URL: https://github.com/apache/iceberg-python/pull/85#issuecomment-1793808567 Thanks, @Fokko! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Replace black by Ruff Formatter [iceberg-python]
Fokko commented on code in PR #127: URL: https://github.com/apache/iceberg-python/pull/127#discussion_r1382622265 ## .pre-commit-config.yaml: ## @@ -29,15 +29,11 @@ repos: - id: check-ast - repo: https://github.com/astral-sh/ruff-pre-commit # Ruff version (Used for linting) -rev: v0.0.291 Review Comment: Does it come with the new version? I don't see any related config (for example, line length) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump version to 0.6.0 [iceberg-python]
rdblue commented on PR #72: URL: https://github.com/apache/iceberg-python/pull/72#issuecomment-1793809215 Looks good to me. Merge when you're ready. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Support of before and after actions in preorderschema traversal [iceberg-python]
Fokko commented on PR #42: URL: https://github.com/apache/iceberg-python/pull/42#issuecomment-1793809220 This was suggested here: https://github.com/apache/iceberg/pull/7831/files#r1285259053 I'll leave it up to @rdblue to decide if he thinks this is valuable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Consider moving to ParallelIterable in Deletes::toPositionIndex [iceberg]
rdblue commented on PR #6432: URL: https://github.com/apache/iceberg/pull/6432#issuecomment-1793809960 #8805 was merged so I'll close this. I should also note that @aokolnychyi raised some concerns about this approach instead of a more comprehensive fix. This is probably a good start if we don't add further caching. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Consider moving to ParallelIterable in Deletes::toPositionIndex [iceberg]
rdblue closed pull request #6432: Consider moving to ParallelIterable in Deletes::toPositionIndex URL: https://github.com/apache/iceberg/pull/6432 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] added contributing.md file [iceberg-python]
Fokko commented on PR #102: URL: https://github.com/apache/iceberg-python/pull/102#issuecomment-1793815821 But what are your thoughts on linking from the `CONTRIBUTING.md` to the website? Otherwise, it is abound to get out of sync. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump version to 0.6.0 [iceberg-python]
Fokko merged PR #72: URL: https://github.com/apache/iceberg-python/pull/72 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Bump version to 0.6.0 [iceberg-python]
Fokko commented on PR #72: URL: https://github.com/apache/iceberg-python/pull/72#issuecomment-1793816199 👍 Thanks for the review @rdblue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] added contributing.md file [iceberg-python]
onemriganka commented on PR #102: URL: https://github.com/apache/iceberg-python/pull/102#issuecomment-1793819378 OK sir, if you think the website is more helpful then ok... Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
stevenzwu commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1382659616 ## api/src/main/java/org/apache/iceberg/Scan.java: ## @@ -77,6 +78,21 @@ public interface Scan> { */ ThisT includeColumnStats(); + /** + * Create a new scan from this that loads the column stats for the specific columns with each data + * file. If the columns set is empty or null then all column stats will be kept, if + * {@link #includeColumnStats()} is set. + * + * Column stats include: value count, null value count, lower bounds, and upper bounds. + * + * @param columnsToKeepStats column ids from the table's schema Review Comment: +1 on using string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
stevenzwu commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1382661567 ## core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.util; + +import java.util.Set; +import org.apache.iceberg.ContentFile; + +public class ContentFileUtil { + private ContentFileUtil() {} + + /** + * Copies the {@link ContentFile} with the specific stat settings. + * + * @param file a generic data file to copy. + * @param withStats whether to keep any stats + * @param columnsToKeepStats a set of column ids to keep stats. If empty or null then + * every column stat is kept. Review Comment: @aokolnychyi the proposed interfaces for `Scan` and `ContentFile` make sense to me. > A collection with a single element * means we will call copy() on files. > Null or empty collection means we will call copyWithoutStats() on files. Regarding the collection of a single `*` element, it feels a little tacky to me. The util method/class takes the two configs from the `TableScanContext` and implements the copy logic. > All of the logic above can be incapsulated in a single copy method in our base scan. This `ContentFileUtil` was intended for that purpose of incapsulating all those logic. Note that `ManifestGroup` class also leverages this util method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Spark 3.5: Don't throw exception when decoding dictionary of type INT96 [iceberg]
manuzhang commented on PR #8988: URL: https://github.com/apache/iceberg/pull/8988#issuecomment-1793942486 @yabola @nastra PTAL, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1382704220 ## core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java: ## @@ -400,8 +400,15 @@ public void replaceTableViaTransactionThatAlreadyExistsAsView() { .buildTable(viewIdentifier, SCHEMA) .replaceTransaction() .commitTransaction()) -.isInstanceOf(NoSuchTableException.class) -.hasMessageStartingWith("Table does not exist: ns.view"); +.satisfiesAnyOf( +throwable -> +assertThat(throwable) +.isInstanceOf(NoSuchTableException.class) +.hasMessageStartingWith("Table does not exist: ns.view"), +throwable -> +assertThat(throwable) Review Comment: So far, REST and inmemory catalog follows one pattern (NoSuchTableException) and all other catalogs follows one pattern (AlreadyExistsException). I tried making Nessie to follow as REST catalog, but it breaks other testcases. I will check more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1382710543 ## nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java: ## @@ -0,0 +1,351 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.nessie; + +import static org.apache.iceberg.types.Types.NestedField.optional; +import static org.apache.iceberg.types.Types.NestedField.required; + +import java.io.File; +import java.io.IOException; +import java.net.URI; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; +import org.apache.iceberg.Schema; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.types.Types; +import org.apache.iceberg.view.SQLViewRepresentation; +import org.apache.iceberg.view.View; +import org.assertj.core.api.Assertions; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.projectnessie.client.ext.NessieClientFactory; +import org.projectnessie.client.ext.NessieClientUri; +import org.projectnessie.error.NessieNotFoundException; +import org.projectnessie.model.Branch; +import org.projectnessie.model.CommitMeta; +import org.projectnessie.model.ContentKey; +import org.projectnessie.model.IcebergView; +import org.projectnessie.model.ImmutableTableReference; +import org.projectnessie.model.LogResponse.LogEntry; + +public class TestNessieView extends BaseTestIceberg { + + private static final String BRANCH = "iceberg-view-test"; + + private static final String DB_NAME = "db"; + private static final String VIEW_NAME = "view"; + private static final TableIdentifier VIEW_IDENTIFIER = TableIdentifier.of(DB_NAME, VIEW_NAME); + private static final ContentKey KEY = ContentKey.of(DB_NAME, VIEW_NAME); + private static final Schema schema = + new Schema(Types.StructType.of(required(1, "id", Types.LongType.get())).fields()); + private static final Schema altered = + new Schema( + Types.StructType.of( + required(1, "id", Types.LongType.get()), + optional(2, "data", Types.LongType.get())) + .fields()); + + private String viewLocation; + + public TestNessieView() { +super(BRANCH); + } + + @Override + @BeforeEach + public void beforeEach(NessieClientFactory clientFactory, @NessieClientUri URI nessieUri) + throws IOException { +super.beforeEach(clientFactory, nessieUri); +this.viewLocation = +createView(catalog, VIEW_IDENTIFIER, schema).location().replaceFirst("file:", ""); + } + + @Override + @AfterEach + public void afterEach() throws Exception { +// drop the view data +if (viewLocation != null) { + try (Stream walk = Files.walk(Paths.get(viewLocation))) { + walk.sorted(Comparator.reverseOrder()).map(Path::toFile).forEach(File::delete); + } + catalog.dropView(VIEW_IDENTIFIER); +} + +super.afterEach(); + } + + private IcebergView getView(ContentKey key) throws NessieNotFoundException { +return getView(BRANCH, key); + } + + private IcebergView getView(String ref, ContentKey key) throws NessieNotFoundException { +return api.getContent().key(key).refName(ref).get().get(key).unwrap(IcebergView.class).get(); + } + + /** Verify that Nessie always returns the globally-current global-content w/ only DMLs. */ + @Test + public void verifyStateMovesForDML() throws Exception { +// 1. initialize view +View icebergView = catalog.loadView(VIEW_IDENTIFIER); +icebergView +.replaceVersion() +.withQuery("spark", "some query") +.withSchema(schema) +.withDefaultNamespace(VIEW_IDENTIFIER.namespace()) +.commit(); + +// 2. create 2nd branch +String testCaseBranch = "verify-global-moving"; +api.createReference() +.sourceRefName(BRANCH) +.reference(Branch.of(testCaseBranch, catalog.currentHash())) +.create(); +try (NessieCatalog ignore = initCatalog(testCaseBranch)) { + + IcebergView conte
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1382711672 ## nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java: ## @@ -0,0 +1,351 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.nessie; + +import static org.apache.iceberg.types.Types.NestedField.optional; +import static org.apache.iceberg.types.Types.NestedField.required; + +import java.io.File; +import java.io.IOException; +import java.net.URI; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; +import org.apache.iceberg.Schema; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.types.Types; +import org.apache.iceberg.view.SQLViewRepresentation; +import org.apache.iceberg.view.View; +import org.assertj.core.api.Assertions; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.projectnessie.client.ext.NessieClientFactory; +import org.projectnessie.client.ext.NessieClientUri; +import org.projectnessie.error.NessieNotFoundException; +import org.projectnessie.model.Branch; +import org.projectnessie.model.CommitMeta; +import org.projectnessie.model.ContentKey; +import org.projectnessie.model.IcebergView; +import org.projectnessie.model.ImmutableTableReference; +import org.projectnessie.model.LogResponse.LogEntry; + +public class TestNessieView extends BaseTestIceberg { + + private static final String BRANCH = "iceberg-view-test"; + + private static final String DB_NAME = "db"; + private static final String VIEW_NAME = "view"; + private static final TableIdentifier VIEW_IDENTIFIER = TableIdentifier.of(DB_NAME, VIEW_NAME); + private static final ContentKey KEY = ContentKey.of(DB_NAME, VIEW_NAME); + private static final Schema schema = + new Schema(Types.StructType.of(required(1, "id", Types.LongType.get())).fields()); + private static final Schema altered = + new Schema( + Types.StructType.of( + required(1, "id", Types.LongType.get()), + optional(2, "data", Types.LongType.get())) + .fields()); + + private String viewLocation; + + public TestNessieView() { +super(BRANCH); + } + + @Override + @BeforeEach + public void beforeEach(NessieClientFactory clientFactory, @NessieClientUri URI nessieUri) + throws IOException { +super.beforeEach(clientFactory, nessieUri); +this.viewLocation = +createView(catalog, VIEW_IDENTIFIER, schema).location().replaceFirst("file:", ""); + } + + @Override + @AfterEach + public void afterEach() throws Exception { +// drop the view data +if (viewLocation != null) { + try (Stream walk = Files.walk(Paths.get(viewLocation))) { + walk.sorted(Comparator.reverseOrder()).map(Path::toFile).forEach(File::delete); + } + catalog.dropView(VIEW_IDENTIFIER); +} + +super.afterEach(); + } + + private IcebergView getView(ContentKey key) throws NessieNotFoundException { +return getView(BRANCH, key); + } + + private IcebergView getView(String ref, ContentKey key) throws NessieNotFoundException { +return api.getContent().key(key).refName(ref).get().get(key).unwrap(IcebergView.class).get(); + } + + /** Verify that Nessie always returns the globally-current global-content w/ only DMLs. */ + @Test + public void verifyStateMovesForDML() throws Exception { +// 1. initialize view +View icebergView = catalog.loadView(VIEW_IDENTIFIER); +icebergView +.replaceVersion() +.withQuery("spark", "some query") +.withSchema(schema) +.withDefaultNamespace(VIEW_IDENTIFIER.namespace()) +.commit(); + +// 2. create 2nd branch +String testCaseBranch = "verify-global-moving"; +api.createReference() +.sourceRefName(BRANCH) +.reference(Branch.of(testCaseBranch, catalog.currentHash())) +.create(); +try (NessieCatalog ignore = initCatalog(testCaseBranch)) { + + IcebergView conte
Re: [I] Flink write iceberg bug(org.apache.iceberg.exceptions.NotFoundException) [iceberg]
pvary commented on issue #5846: URL: https://github.com/apache/iceberg/issues/5846#issuecomment-1794170117 @lirui-apache: For the record: To restore the state of the Flink job, you need the previous snapshot (to identify the last committed snapshot), and the new data files and temporary manifest files (if the failure happened between `snapshotState`, and `notifySnapshotCompltete`). So snapshot expiration and cleanup orphan files could also corrupt the state of the Flink job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1382844294 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieViewOperations.java: ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.nessie; + +import java.util.Map; +import java.util.concurrent.atomic.AtomicBoolean; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.iceberg.view.ViewMetadataParser; +import org.projectnessie.client.http.HttpClientException; +import org.projectnessie.error.NessieBadRequestException; +import org.projectnessie.error.NessieConflictException; +import org.projectnessie.error.NessieNotFoundException; +import org.projectnessie.model.Content; +import org.projectnessie.model.ContentKey; +import org.projectnessie.model.IcebergTable; +import org.projectnessie.model.IcebergView; +import org.projectnessie.model.Reference; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class NessieViewOperations extends BaseViewOperations { + + private static final Logger LOG = LoggerFactory.getLogger(NessieViewOperations.class); + + private final NessieIcebergClient client; + + private final ContentKey key; + private final FileIO fileIO; + private final Map catalogOptions; + private IcebergView icebergView; + + NessieViewOperations( + ContentKey key, + NessieIcebergClient client, + FileIO fileIO, + Map catalogOptions) { +this.key = key; +this.client = client; +this.fileIO = fileIO; +this.catalogOptions = catalogOptions; + } + + @Override + public void doRefresh() { +try { + client.refresh(); +} catch (NessieNotFoundException e) { + throw new RuntimeException( + String.format( + "Failed to refresh as ref '%s' is no longer valid.", client.getRef().getName()), + e); +} +String metadataLocation = null; +Reference reference = client.getRef().getReference(); +try { + Content content = client.getApi().getContent().key(key).reference(reference).get().get(key); + LOG.debug("Content '{}' at '{}': {}", key, reference, content); + if (content == null) { +if (currentMetadataLocation() != null) { + throw new NoSuchViewException("View does not exist: %s in %s", key, reference); +} + } else { +this.icebergView = +content +.unwrap(IcebergView.class) +.orElseThrow( +() -> { + if (content instanceof IcebergTable) { +return new AlreadyExistsException( +"Table with same name already exists: %s in %s", key, reference); + } else { +return new IllegalStateException( +String.format( +"Cannot refresh Iceberg view: Nessie points to a non-Iceberg object for path: %s.", +key)); + } +}); +metadataLocation = icebergView.getMetadataLocation(); + } +} catch (NessieNotFoundException ex) { + if (currentMetadataLocation() != null) { +throw new NoSuchViewException("View does not exist: %s in %s", key, reference); + } +} +refreshFromMetadataLocation(metadataLocation, null, 2, l -> loadViewMetadata(l, reference)); + } + + private ViewMetadata loadViewMetadata(String metadataLocation, Reference reference) { +ViewMetadata metadata = ViewMetadataParser.read(io().newInputFile(metadataLocation)); +Map newProperties = Maps.newHashMap(metadata.properties()); +newProperties.put(NessieTableOperations.NESSIE_COMMIT_ID_PROPERTY, reference.getHash()); + +return ViewMetadata.buildFrom( + ViewMetadata.buildFrom(metadata).setProperties(newProperties).build()
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r138284 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java: ## @@ -135,71 +135,26 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) { boolean newTable = base == null; String newMetadataLocation = writeNewMetadataIfRequired(newTable, metadata); -String refName = client.refName(); -boolean failure = false; +AtomicBoolean failure = new AtomicBoolean(false); Review Comment: removed `AtomicBoolean` as it can be simplified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1382844582 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java: ## @@ -165,4 +180,77 @@ public static TableMetadata updateTableMetadataWithNessieSpecificProperties( return builder.discardChanges().build(); } + + static void handleExceptionsForCommits( + Exception exception, String refName, AtomicBoolean failure, Content.Type type) { +if (exception instanceof NessieConflictException) { + failure.set(true); Review Comment: removed `AtomicBoolean` as it can be simplified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1382844788 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java: ## @@ -132,74 +131,36 @@ protected void doRefresh() { @Override protected void doCommit(TableMetadata base, TableMetadata metadata) { +Content content = null; +try { + content = + client.getApi().getContent().key(key).reference(client.getReference()).get().get(key); +} catch (NessieNotFoundException e) { Review Comment: updated. ## nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java: ## @@ -132,74 +131,36 @@ protected void doRefresh() { @Override protected void doCommit(TableMetadata base, TableMetadata metadata) { +Content content = null; +try { + content = + client.getApi().getContent().key(key).reference(client.getReference()).get().get(key); +} catch (NessieNotFoundException e) { + // Ignore the exception as the first commit may not have the content present for the key. +} + +if (content != null && content.getType() == Content.Type.ICEBERG_VIEW) { Review Comment: updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1382845319 ## nessie/src/test/java/org/apache/iceberg/nessie/TestNessieView.java: ## @@ -0,0 +1,351 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.nessie; + +import static org.apache.iceberg.types.Types.NestedField.optional; +import static org.apache.iceberg.types.Types.NestedField.required; + +import java.io.File; +import java.io.IOException; +import java.net.URI; +import java.nio.file.Files; +import java.nio.file.Path; +import java.nio.file.Paths; +import java.util.Comparator; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; +import org.apache.iceberg.Schema; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.types.Types; +import org.apache.iceberg.view.SQLViewRepresentation; +import org.apache.iceberg.view.View; +import org.assertj.core.api.Assertions; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.projectnessie.client.ext.NessieClientFactory; +import org.projectnessie.client.ext.NessieClientUri; +import org.projectnessie.error.NessieNotFoundException; +import org.projectnessie.model.Branch; +import org.projectnessie.model.CommitMeta; +import org.projectnessie.model.ContentKey; +import org.projectnessie.model.IcebergView; +import org.projectnessie.model.ImmutableTableReference; +import org.projectnessie.model.LogResponse.LogEntry; + +public class TestNessieView extends BaseTestIceberg { + + private static final String BRANCH = "iceberg-view-test"; + + private static final String DB_NAME = "db"; + private static final String VIEW_NAME = "view"; + private static final TableIdentifier VIEW_IDENTIFIER = TableIdentifier.of(DB_NAME, VIEW_NAME); + private static final ContentKey KEY = ContentKey.of(DB_NAME, VIEW_NAME); + private static final Schema schema = + new Schema(Types.StructType.of(required(1, "id", Types.LongType.get())).fields()); + private static final Schema altered = + new Schema( + Types.StructType.of( + required(1, "id", Types.LongType.get()), + optional(2, "data", Types.LongType.get())) + .fields()); + + private String viewLocation; + + public TestNessieView() { +super(BRANCH); + } + + @Override + @BeforeEach + public void beforeEach(NessieClientFactory clientFactory, @NessieClientUri URI nessieUri) + throws IOException { +super.beforeEach(clientFactory, nessieUri); +this.viewLocation = +createView(catalog, VIEW_IDENTIFIER, schema).location().replaceFirst("file:", ""); + } + + @Override + @AfterEach + public void afterEach() throws Exception { +// drop the view data +if (viewLocation != null) { + try (Stream walk = Files.walk(Paths.get(viewLocation))) { + walk.sorted(Comparator.reverseOrder()).map(Path::toFile).forEach(File::delete); + } + catalog.dropView(VIEW_IDENTIFIER); +} + +super.afterEach(); + } + + private IcebergView getView(ContentKey key) throws NessieNotFoundException { +return getView(BRANCH, key); + } + + private IcebergView getView(String ref, ContentKey key) throws NessieNotFoundException { +return api.getContent().key(key).refName(ref).get().get(key).unwrap(IcebergView.class).get(); + } + + /** Verify that Nessie always returns the globally-current global-content w/ only DMLs. */ + @Test + public void verifyStateMovesForDML() throws Exception { +// 1. initialize view +View icebergView = catalog.loadView(VIEW_IDENTIFIER); +icebergView +.replaceVersion() +.withQuery("spark", "some query") +.withSchema(schema) +.withDefaultNamespace(VIEW_IDENTIFIER.namespace()) +.commit(); + +// 2. create 2nd branch +String testCaseBranch = "verify-global-moving"; +api.createReference() +.sourceRefName(BRANCH) +.reference(Branch.of(testCaseBranch, catalog.currentHash())) +.create(); +try (NessieCatalog ignore = initCatalog(testCaseBranch)) { + + IcebergView conte
Re: [I] Ability to the write Metadata JSON [iceberg-python]
HonahX commented on issue #22: URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1794191459 Hi @Fokko. Is there an update on this issue? I am interested in taking this if it's still open. In terms of implementation, I was thinking of something like this: ```python def update_table_metadata(base_metadata: TableMetadata, updates: Tuple[TableUpdate, ...]) -> TableMetadata: builder = TableMetadataUpdateBuilder(base_metadata) for update in updates: builder.update_table_metadata(update) return builder.build() ``` Does this approach align with your expectations? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]
liurenjie1024 commented on code in PR #8981: URL: https://github.com/apache/iceberg/pull/8981#discussion_r1382867987 ## format/spec.md: ## @@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` then `pos` to optimize Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row. -Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files. +Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). The column restrictions for columns used in equality delete files are the same as those for [identifier fields](#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a +parent struct column is null it implies the leaf column is null). Review Comment: I think there is one missing part: how null values treated in equality ids? Are they treat as equal or unequal? Identity ids don't allow null values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org