[GitHub] [iceberg] wanlce commented on issue #8510: Iceberg does not trigger actually rewrite_data_files in certain situations
wanlce commented on issue #8510: URL: https://github.com/apache/iceberg/issues/8510#issuecomment-1709634179 Sincerely hope to receive answers from the teachers @nastra @wypoon Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko opened a new issue, #8518: Python: Write Metadata json
Fokko opened a new issue, #8518: URL: https://github.com/apache/iceberg/issues/8518 ### Feature Request / Improvement With the REST catalog setting properties, and updating the schema is working. This is easy since the REST catalog is in charge of updating the metadata, and writing a new version of it. For the other catalogs, we need to write the metadata using PyIceberg and update the table property, pointing to the latest version of the metadata. ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #8514: Python: implement GlueCatalog._commit_table
Fokko commented on issue #8514: URL: https://github.com/apache/iceberg/issues/8514#issuecomment-1709791580 @antonysouthworth-halter This is not implemented today but is high on the list. Also, other users are looking for snapshot expiration, which is a very similar operation. I've created [this](https://github.com/apache/iceberg/issues/8518) ticket which has to happen first, then updating the `SortOrder` should be straightforward. I'll keep this ticket open for visibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on pull request #7105: Spec: Add partition stats spec
szehon-ho commented on PR #7105: URL: https://github.com/apache/iceberg/pull/7105#issuecomment-1709898544 @ajantha-bhat i think that makes sense to me, discussion is here for reference: https://github.com/apache/iceberg/pull/7105#discussion_r1318005645 Yea let's see if there is consensus. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on pull request #8299: Core: use ZSTD compression parquet by default for new tables
szehon-ho commented on PR #8299: URL: https://github.com/apache/iceberg/pull/8299#issuecomment-1709907691 Rebased on #3438 , TestCompression tests now pass, fyi @aokolnychyi when you are back -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on a diff in pull request #8517: To fix the issue https://github.com/apache/iceberg/issues/8516
szehon-ho commented on code in PR #8517: URL: https://github.com/apache/iceberg/pull/8517#discussion_r1318413830 ## docs/configuration.md: ## @@ -74,7 +74,7 @@ Iceberg tables support table properties to configure table behavior, like the de | write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full | | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | -| write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | +| write.distribution-mode | hash | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | Review Comment: Looking at code, it looks a bit more nuanced: none for unpartitioned table, hash for partitioned table, range for sorted table. Ref: https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java#L263 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] szehon-ho commented on pull request #8517: To fix the issue https://github.com/apache/iceberg/issues/8516
szehon-ho commented on PR #8517: URL: https://github.com/apache/iceberg/pull/8517#issuecomment-1709914728 Thanks for pr, also please make the PR description clear what it does (we can put the link to the issue in the PR description if you want) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #8499: Python: Add support for Python 3.12
Fokko commented on issue #8499: URL: https://github.com/apache/iceberg/issues/8499#issuecomment-1709922231 Thanks @steinsgateted that would be awesome -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] frankliee commented on issue #8487: Python: Cannot resolve avro file
frankliee commented on issue #8487: URL: https://github.com/apache/iceberg/issues/8487#issuecomment-1709973457 @Fokko Thanks for your reminder, I have found the reason is that the wrong iceberg version is used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] frankliee commented on issue #8487: Python: Cannot resolve avro file
frankliee commented on issue #8487: URL: https://github.com/apache/iceberg/issues/8487#issuecomment-1709976000 I could query the iceberg table by the latest version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] frankliee commented on issue #8482: Python: Cannot install from source code.
frankliee commented on issue #8482: URL: https://github.com/apache/iceberg/issues/8482#issuecomment-1709979652 @Fokko Thanks for you reminder, the root cause is pip version, and pip 23.0 could compile pyiceberg correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] atifiu commented on a diff in pull request #8517: To fix the issue https://github.com/apache/iceberg/issues/8516
atifiu commented on code in PR #8517: URL: https://github.com/apache/iceberg/pull/8517#discussion_r1318465020 ## docs/configuration.md: ## @@ -74,7 +74,7 @@ Iceberg tables support table properties to configure table behavior, like the de | write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full | | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | -| write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | +| write.distribution-mode | hash | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | Review Comment: Shall we document this. > Defines distribution of write data: none: don’t shuffle rows; hash: hash distribute by partition key ; range: range distribute by partition key or sort key if table has an SortOrder. Default value **none** for unpartitioned table, **hash** for partitioned table and **range** for sorted table. ## docs/configuration.md: ## @@ -74,7 +74,7 @@ Iceberg tables support table properties to configure table behavior, like the de | write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full | | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | -| write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | +| write.distribution-mode | hash | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | Review Comment: Shall we document this. > Defines distribution of write data: none: don’t shuffle rows; hash: hash distribute by partition key ; range: range distribute by partition key or sort key if table has an SortOrder. Default value **none** for unpartitioned table, **hash** for partitioned table and **range** for sorted table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] atifiu commented on a diff in pull request #8517: To fix the table configuration documentation for write.distrubition-mode
atifiu commented on code in PR #8517: URL: https://github.com/apache/iceberg/pull/8517#discussion_r1318465020 ## docs/configuration.md: ## @@ -74,7 +74,7 @@ Iceberg tables support table properties to configure table behavior, like the de | write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full | | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | -| write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | +| write.distribution-mode | hash | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | Review Comment: Shall we document this in the description and leave the default value as none. > Defines distribution of write data: none: don’t shuffle rows; hash: hash distribute by partition key ; range: range distribute by partition key or sort key if table has an SortOrder. Default value **none** for unpartitioned table, **hash** for partitioned table and **range** for sorted table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #8487: Python: Cannot resolve avro file
Fokko commented on issue #8487: URL: https://github.com/apache/iceberg/issues/8487#issuecomment-1710007874 Thank you for letting me know. Also pyiceberg==0.5.0rc1 is available on PyPi for testing :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko closed issue #8487: Python: Cannot resolve avro file
Fokko closed issue #8487: Python: Cannot resolve avro file URL: https://github.com/apache/iceberg/issues/8487 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #8482: Python: Cannot install from source code.
Fokko commented on issue #8482: URL: https://github.com/apache/iceberg/issues/8482#issuecomment-1710014853 @frankliee Thanks for letting me know. I've added a note on the website to update `pip`: https://github.com/apache/iceberg/pull/8484 Let's close this one as well, thanks for reporting this, since it will improve PyIceberg and the documentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko closed issue #8482: Python: Cannot install from source code.
Fokko closed issue #8482: Python: Cannot install from source code. URL: https://github.com/apache/iceberg/issues/8482 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] sandflee opened a new pull request, #8519: add the process of snapshots queryParam for RESTCatalogAdapter
sandflee opened a new pull request, #8519: URL: https://github.com/apache/iceberg/pull/8519 #6850 add lazy snapshot loading for RESTCatalog, but RESTCatalogAdapter doesn't process the 'snapshots' queryParam. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] juanrondineau commented on issue #8333: Unable to merge CDC data into snapshot data. java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.la
juanrondineau commented on issue #8333: URL: https://github.com/apache/iceberg/issues/8333#issuecomment-1710250429 Hi, we are running spark jobs from DBT fwk, and we have the same issue when we run a model, with merge strategie, that has a table with struct data_type columns in its schema as source. Internally, for models with merge strategie, dbt creates a temporary view with the data that has to merge in the destination table. We recently discover that if insteed of a temprary view, we materialize the new data as a table and the we run the merge we save the CastException. That seems to be a workaround and not the final solution to this issue. We are actually running spark 3.3.2, iceberg 1.3.0 and hive 2.3.9 as metastore. Hope this help to solve the case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #8517: To fix the table configuration documentation for write.distrubition-mode
RussellSpitzer commented on code in PR #8517: URL: https://github.com/apache/iceberg/pull/8517#discussion_r1318702274 ## docs/configuration.md: ## @@ -74,7 +74,7 @@ Iceberg tables support table properties to configure table behavior, like the de | write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full | | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | -| write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | +| write.distribution-mode | hash | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | Review Comment: Let's say the default is hash ... or "depends"? I still think none is wrong since almost no one is using unpartitioned tables -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] atifiu commented on a diff in pull request #8517: To fix the table configuration documentation for write.distrubition-mode
atifiu commented on code in PR #8517: URL: https://github.com/apache/iceberg/pull/8517#discussion_r1318706970 ## docs/configuration.md: ## @@ -74,7 +74,7 @@ Iceberg tables support table properties to configure table behavior, like the de | write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full | | write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | | write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | -| write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | +| write.distribution-mode | hash | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | Review Comment: This is not a normal case because default itself is depended on other factors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ext-diego-duenas commented on issue #8333: Unable to merge CDC data into snapshot data. java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java
ext-diego-duenas commented on issue #8333: URL: https://github.com/apache/iceberg/issues/8333#issuecomment-1710293072 Just to add to the Thread, we made a test changing all data types to String in the table with structs data type, and the error persist, so we don't think the error is related with the schema. Is applying a merge into in an iceberg table, with a source table with struct data types, regardless the underlying schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar opened a new pull request, #8520: Core: Clear deleted files set between commit attempts for simple transactions
amogh-jahagirdar opened a new pull request, #8520: URL: https://github.com/apache/iceberg/pull/8520 For transactions, a `deletedFiles` set is maintained which keeps track of the files to delete as part of the most recent commit. For simple transactions, this set is not being cleared between attempts to commit the transaction. It could be possible that on the first transaction attempt, an operation marks certain files for deletion. When the transaction fails and is retried those files may actually be used, and when the retry succeeds those files are cleaned up when they should not be. The set should preserve the invariant of representing the set of files to clear after a transaction is committed, thus this PR clears the set between transaction commit attempts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on pull request #8520: Core: Clear deleted files set between commit attempts for simple transactions
amogh-jahagirdar commented on PR #8520: URL: https://github.com/apache/iceberg/pull/8520#issuecomment-1710405905 Still need to add test cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-go] rdblue commented on pull request #3: feat(manifests): Adding implementation of manifest files
rdblue commented on PR #3: URL: https://github.com/apache/iceberg-go/pull/3#issuecomment-1710410048 @zeroshade, sorry for the delay! I'll find some time for reviews over here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] Xuanwo commented on issue #55: Iceberg View Support
Xuanwo commented on issue #55: URL: https://github.com/apache/iceberg-rust/issues/55#issuecomment-1710446025 I have a new idea for naming: How about using `Tabular`? So we will have: - `load_tabular()` / `load_table()` / `load_view()` - `Tabular::Table` / `Tabular::View` What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #8397: API, Core: Add strict metadata cleanup to TableOperations
rdblue commented on code in PR #8397: URL: https://github.com/apache/iceberg/pull/8397#discussion_r1318868914 ## core/src/main/java/org/apache/iceberg/BaseTransaction.java: ## @@ -380,14 +375,9 @@ private void commitReplaceTransaction(boolean orCreate) { } catch (RuntimeException e) { // the commit failed and no files were committed. clean up each update. - Tasks.foreach(updates) - .suppressFailureWhenFinished() - .run( - update -> { -if (update instanceof SnapshotProducer) { - ((SnapshotProducer) update).cleanAll(); -} - }); + if (!ops.requireStrictCleanup() || e instanceof CleanableFailure) { +cleanUpOnCommitFailure(); Review Comment: Shouldn't this be `cleanAllUpdates`? Both `cleanUpOnCommitFailure` and the `finally` block are going to run the delete function on every file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on a diff in pull request #8397: API, Core: Add strict metadata cleanup to TableOperations
rdblue commented on code in PR #8397: URL: https://github.com/apache/iceberg/pull/8397#discussion_r1318870175 ## core/src/main/java/org/apache/iceberg/BaseTransaction.java: ## @@ -380,14 +375,9 @@ private void commitReplaceTransaction(boolean orCreate) { } catch (RuntimeException e) { // the commit failed and no files were committed. clean up each update. - Tasks.foreach(updates) - .suppressFailureWhenFinished() - .run( - update -> { -if (update instanceof SnapshotProducer) { - ((SnapshotProducer) update).cleanAll(); -} - }); + if (!ops.requireStrictCleanup() || e instanceof CleanableFailure) { +cleanUpOnCommitFailure(); Review Comment: Otherwise, I think this looks good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] Samrose-Ahmed commented on issue #55: Iceberg View Support
Samrose-Ahmed commented on issue #55: URL: https://github.com/apache/iceberg-rust/issues/55#issuecomment-1710482809 > I have a new idea for naming: How about using `Tabular`? > > So we will have: > > - `load_tabular()` / `load_table()` / `load_view()` > - `Tabular::Table(table)` / `Tabular::View(view)` > > What do you think? I would avoid the confusion with the vendor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #8397: API, Core: Add strict metadata cleanup to TableOperations
rdblue commented on PR #8397: URL: https://github.com/apache/iceberg/pull/8397#issuecomment-1710483728 Looks good to me. Let's merge when tests are passing. Thanks, @amogh-jahagirdar! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #8397: API, Core: Add strict metadata cleanup to TableOperations
amogh-jahagirdar commented on code in PR #8397: URL: https://github.com/apache/iceberg/pull/8397#discussion_r1318884428 ## core/src/main/java/org/apache/iceberg/BaseTransaction.java: ## @@ -380,14 +375,9 @@ private void commitReplaceTransaction(boolean orCreate) { } catch (RuntimeException e) { // the commit failed and no files were committed. clean up each update. - Tasks.foreach(updates) - .suppressFailureWhenFinished() - .run( - update -> { -if (update instanceof SnapshotProducer) { - ((SnapshotProducer) update).cleanAll(); -} - }); + if (!ops.requireStrictCleanup() || e instanceof CleanableFailure) { +cleanUpOnCommitFailure(); Review Comment: Oh yes good catch, we're duplicating the call to delete the files. Although I wonder how the create/replace tests for this case were passing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #8397: API, Core: Add strict metadata cleanup to TableOperations
amogh-jahagirdar commented on code in PR #8397: URL: https://github.com/apache/iceberg/pull/8397#discussion_r1318885267 ## core/src/main/java/org/apache/iceberg/BaseTransaction.java: ## @@ -380,14 +375,9 @@ private void commitReplaceTransaction(boolean orCreate) { } catch (RuntimeException e) { // the commit failed and no files were committed. clean up each update. - Tasks.foreach(updates) - .suppressFailureWhenFinished() - .run( - update -> { -if (update instanceof SnapshotProducer) { - ((SnapshotProducer) update).cleanAll(); -} - }); + if (!ops.requireStrictCleanup() || e instanceof CleanableFailure) { +cleanUpOnCommitFailure(); Review Comment: Ah the delete is best effort so the second time when the file no longer exists the delete would fail but we catch it and emit a warn log. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] Xuanwo commented on issue #55: Iceberg View Support
Xuanwo commented on issue #55: URL: https://github.com/apache/iceberg-rust/issues/55#issuecomment-171048 > I would avoid the confusion with the vendor. Oh, I didn't take the vendor into consideration. I think tabular is a good alternative to table like and widely used in pandas ecosystem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-go] JustinTimperio commented on a diff in pull request #2: feat(partitioning): Add partition specs/fields and basic transform interface
JustinTimperio commented on code in PR #2: URL: https://github.com/apache/iceberg-go/pull/2#discussion_r1318954109 ## partitions.go: ## @@ -0,0 +1,218 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +package iceberg + +import ( + "encoding/json" + "fmt" + "strings" + + "golang.org/x/exp/slices" Review Comment: Generally I am really against using the experimental packages for anything that will be used likely in production. There recently has been a lot of people ignoring this and creating havoc in the release cycles of Go as seen in the following issues: https://github.com/golang/go/issues/51317 https://github.com/golang/go/issues/61374 I think it would be better at this time to avoid using anything in the exp branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-go] JustinTimperio commented on a diff in pull request #2: feat(partitioning): Add partition specs/fields and basic transform interface
JustinTimperio commented on code in PR #2: URL: https://github.com/apache/iceberg-go/pull/2#discussion_r1318955697 ## partitions.go: ## @@ -0,0 +1,218 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +package iceberg + +import ( + "encoding/json" + "fmt" + "strings" + + "golang.org/x/exp/slices" +) + +const ( + partitionDataIDStart = 1000 + InitialPartitionSpecID = 0 +) + +// UnpartitionedPartitionSpec is the default unpartitioned spec which can +// be used for comparisons or to just provide a convenience for referencing +// the same unpartitioned spec object. +var UnpartitionedPartitionSpec = &PartitionSpec{id: 0} + +// PartitionField represents how one partition value is derived from the +// source column by transformation. +type PartitionField struct { + // SourceID is the source column id of the table's schema + SourceID int `json:"source-id"` + // FieldID is the partition field id across all the table partition specs + FieldID int `json:"field-id"` + // Name is the name of the partition field itself + Name string `json:"name"` + // Transform is the transform used to produce the partition value + Transform Transform `json:"transform"` +} + +func (p *PartitionField) String() string { + return fmt.Sprintf("%d: %s: %s(%d)", p.FieldID, p.Name, p.Transform, p.SourceID) +} + +func (p *PartitionField) UnmarshalJSON(b []byte) error { + type Alias PartitionField + aux := struct { + TransformString string `json:"transform"` + *Alias + }{ + Alias: (*Alias)(p), + } + + err := json.Unmarshal(b, &aux) + if err != nil { + return err + } + + if p.Transform, err = ParseTransform(aux.TransformString); err != nil { + return err + } + + return nil +} + +// PartitionSpec captures the transformation from table data to partition values +type PartitionSpec struct { + // any change to a PartitionSpec will produce a new spec id + id int + fields []PartitionField + + // this is populated by initialize after creation + sourceIdToFields map[int][]PartitionField +} + +func NewPartitionSpec(fields ...PartitionField) PartitionSpec { + return NewPartitionSpecID(InitialPartitionSpecID, fields...) +} + +func NewPartitionSpecID(id int, fields ...PartitionField) PartitionSpec { + ret := PartitionSpec{id: id, fields: fields} + ret.initialize() + return ret +} + +// CompatibleWith returns true if this partition spec is considered +// compatible with the passed in partition spec. This means that the two +// specs have equivalent field lists regardless of the spec id. +func (ps *PartitionSpec) CompatibleWith(other *PartitionSpec) bool { + if ps == other { + return true + } + + if len(ps.fields) != len(other.fields) { + return false + } + + return slices.EqualFunc(ps.fields, other.fields, func(left, right PartitionField) bool { Review Comment: Happy to write a replacement function for this to avoid the exp package -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar commented on pull request #8397: API, Core: Add strict metadata cleanup to TableOperations
amogh-jahagirdar commented on PR #8397: URL: https://github.com/apache/iceberg/pull/8397#issuecomment-1710574693 Thanks for review @rdblue ! merging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar merged pull request #8397: API, Core: Add strict metadata cleanup to TableOperations
amogh-jahagirdar merged PR #8397: URL: https://github.com/apache/iceberg/pull/8397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] Fokko opened a new pull request, #8521: Python: Non-Cython fallback Avro parser
Fokko opened a new pull request, #8521: URL: https://github.com/apache/iceberg/pull/8521 I think it would be good to include a pure Python fallback. @rustyconover I'm seeing some weird errors here: https://github.com/pola-rs/polars/actions/runs/6109489610/job/16580630135?pr=10375 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] hiteshbedre commented on pull request #8491: Python: Improved Readability and Alignment of Regex Patterns
hiteshbedre commented on PR #8491: URL: https://github.com/apache/iceberg/pull/8491#issuecomment-1710683580 @amogh-jahagirdar who can help me with the next approval? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rafoid opened a new issue, #8522: ALTER TABLE ... DROP COLUMN allows dropping the last column of a table
rafoid opened a new issue, #8522: URL: https://github.com/apache/iceberg/issues/8522 ### Apache Iceberg version 1.3.1 (latest release) ### Query engine None ### Please describe the bug 🐞 When performing ALTER TABLE ... DROP COLUMN, it drops columns as expected. However, the API also allows dropping the last column of a table, rendering the table useless. There are no more operations allowed on the table unless a new column is added. I think we need to check whether the final schema has any columns or not before committing the changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rafoid commented on issue #8522: ALTER TABLE ... DROP COLUMN allows dropping the last column of a table
rafoid commented on issue #8522: URL: https://github.com/apache/iceberg/issues/8522#issuecomment-1710706057 I think we need to modify the `commit()` method as following: ``` @Override public void commit() { TableMetadata update = applyChangesToMetadata(base.updateSchema(apply(), lastColumnId)); if (update.schema().columns().size() == 0) throw new UnsupportedOperationException("Unable to commit table without any columns."); ops.commit(base, update); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rafoid opened a new pull request, #8523: Prevent dropping last column.
rafoid opened a new pull request, #8523: URL: https://github.com/apache/iceberg/pull/8523 This code intends to address the problem of allowing API dropping the last column of a table mentioned here: https://github.com/apache/iceberg/issues/8522 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] github-actions[bot] commented on issue #7071: Dynamic Split generation based on table size
github-actions[bot] commented on issue #7071: URL: https://github.com/apache/iceberg/issues/7071#issuecomment-1710909025 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-docs] melvynator opened a new pull request, #274: Update vendors.md
melvynator opened a new pull request, #274: URL: https://github.com/apache/iceberg-docs/pull/274 Added clickhouse in the list of vendors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] liurenjie1024 commented on issue #55: Iceberg View Support
liurenjie1024 commented on issue #55: URL: https://github.com/apache/iceberg-rust/issues/55#issuecomment-1710958778 > I have a new idea for naming: How about using `Tabular`? > > So we will have: > > * `load_tabular()` / `load_table()` / `load_view()` > * `Tabular::Table(table)` / `Tabular::View(view)` > > What do you think? I asked chatgpt, and he also thinks it's good idea. BTW, should we raise discussion in iceberg community to add an api to catalog interface? Both java and python all have catalog interface: https://github.com/apache/iceberg/blob/f0149a5860efe3f78d7cb07b81b4e1910b438020/api/src/main/java/org/apache/iceberg/catalog/Catalog.java#L33 https://github.com/apache/iceberg/blob/master/python/pyiceberg/catalog/__init__.py -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] ConeyLiu commented on pull request #8523: Prevent dropping last column.
ConeyLiu commented on PR #8523: URL: https://github.com/apache/iceberg/pull/8523#issuecomment-1710964389 What kind of scenario needs to drop all columns? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] frankliee opened a new pull request, #8524: Python: Fix pyarrow hdfs support
frankliee opened a new pull request, #8524: URL: https://github.com/apache/iceberg/pull/8524 I have tested pyiceberg on our hdfs cluster, and this pr fixed some bugs. Changed. 1. Using optional parameters to create hdfs client, similar to gcs's style. 2. Fix `PyArrowFileIO.parse_location` for hdfs style, which cannot be resolved by HadoopFileSystem. Previous behavior: parse_location("hdfs://127.0.0.1:9000/root/abc") -> "hdfs", "127.0.0.1:9000/root/abc" New behavior: parse_location("hdfs://127.0.0.1:9000/root/abc") -> "hdfs", "hdfs://127.0.0.1:9000/root/abc" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rafoid commented on pull request #8523: Prevent dropping last column.
rafoid commented on PR #8523: URL: https://github.com/apache/iceberg/pull/8523#issuecomment-1711009177 @ConeyLiu That's exactly my point - under no circumstances a user should be allowed to drop all columns. At this point the API is permissive enough for such a scenario to occur. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] rustyconover commented on pull request #8521: Python: Non-Cython fallback Avro parser
rustyconover commented on PR #8521: URL: https://github.com/apache/iceberg/pull/8521#issuecomment-1711014712 Hi @Fokko, Reading the logs it looks like it's a packaging problem for Windows. I'm pretty sure Cython works on Windows, but the wheel may not have the necessary module packaged. Linux packages worked well (no PyIceberg errors), but the Windows build failed. I don't have access to Windows so I can't really test it for you, but you may want to revisit including the shared library that is built by Cython. Maybe we need to include *.dll rather than *.so or some other extension on Windows? Rusty -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] amogh-jahagirdar opened a new pull request, #8525: Core: Add a validation API to DeleteFiles which validates files exist prior to attempting to deletion.
amogh-jahagirdar opened a new pull request, #8525: URL: https://github.com/apache/iceberg/pull/8525 Currently there is no validation on delete files API for verifying that the file to be deleted actually exists prior to the commit. This can cause unexpected behavior, for example: 1.) Rewrite Data files compacts FILE_A + some delete files 2.) Concurrently a DeleteFiles call was done for FILE_A. 3.) Deletion gets retried after 1 completes. At the point it retries, FILE_A no longer exists due to the compaction so the deletion is a no-op. From a user's perspective when they go and query the table after 3, they'll still see the data in FILE_A, even though the wouldn't expect it since they recieved a successful delete of FILE_A. This change currently adds a new validation API as opposed to changing the behavior to always do strict validation for purpose of backwards compatibility. I wanted to open this up and discuss with the community to see about if we want the new API or if we just want to change this behavior since it's mainly just used by maintenance procedures. Also I'd need to compare what serializable isolation vs snapshot isolation should guarantee in this concurrent delete scenario. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] JanKaul commented on issue #55: Iceberg View Support
JanKaul commented on issue #55: URL: https://github.com/apache/iceberg-rust/issues/55#issuecomment-1711086189 > I have a new idea for naming: How about using `Tabular`? > > So we will have: > > - `load_tabular()` / `load_table()` / `load_view()` > - `Tabular::Table(table)` / `Tabular::View(view)` > > What do you think? Nice one, I think it's a fitting name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] JanKaul commented on issue #55: Iceberg View Support
JanKaul commented on issue #55: URL: https://github.com/apache/iceberg-rust/issues/55#issuecomment-1711087506 > > I have a new idea for naming: How about using `Tabular`? > > > > So we will have: > > > > * `load_tabular()` / `load_table()` / `load_view()` > > * `Tabular::Table(table)` / `Tabular::View(view)` > > > > What do you think? > > I asked chatgpt, and he also thinks it's good idea. BTW, should we raise discussion in iceberg community to add an api to catalog interface? > > Both java and python all have catalog interface: > > https://github.com/apache/iceberg/blob/f0149a5860efe3f78d7cb07b81b4e1910b438020/api/src/main/java/org/apache/iceberg/catalog/Catalog.java#L33 > > https://github.com/apache/iceberg/blob/master/python/pyiceberg/catalog/__init__.py > > Good point. I think this should be really valuable for java and python. I don't see a way for them to easily get around this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] liurenjie1024 opened a new issue, #8526: Rest catalog server doesn't return table configuration as expected.
liurenjie1024 opened a new issue, #8526: URL: https://github.com/apache/iceberg/issues/8526 ### Apache Iceberg version 1.3.1 (latest release) ### Query engine None ### Please describe the bug 🐞 According to rest [open api's spec](https://github.com/apache/iceberg/blob/621d301b17417af4b71a0e8c93db9db6f8266839/open-api/rest-catalog-open-api.yaml#L1899), the `loadTable` api should return configuration of table in `config` property. This is not implemented in [rest server handler](https://github.com/apache/iceberg/blob/f3826bd9bd08262109df741ca8bf5aa0b697/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java#L256) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] liurenjie1024 commented on issue #8526: Rest catalog server doesn't return table configuration as expected.
liurenjie1024 commented on issue #8526: URL: https://github.com/apache/iceberg/issues/8526#issuecomment-1711123130 I'm happy to take this if confirmed that it's a bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] JanKaul commented on a diff in pull request #42: feat: support transform function
JanKaul commented on code in PR #42: URL: https://github.com/apache/iceberg-rust/pull/42#discussion_r1319404484 ## crates/iceberg/src/transform/truncate.rs: ## @@ -0,0 +1,177 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use arrow_array::ArrayRef; +use arrow_schema::DataType; + +use crate::Error; + +use super::TransformFunction; + +pub struct Truncate { +width: u32, +} + +impl Truncate { +pub fn new(width: u32) -> Self { +Self { width } +} +} + +impl TransformFunction for Truncate { +fn transform(&self, input: ArrayRef) -> crate::Result { +match input.data_type() { +DataType::Int32 => { +let width: i32 = self.width.try_into().map_err(|_| { +Error::new( +crate::ErrorKind::DataInvalid, +"width is failed to convert to i32 when truncate Int32Array", +) +})?; +let res: arrow_array::Int32Array = input +.as_any() +.downcast_ref::() +.unwrap() +.unary(|v| v - (((v % width) + width) % width)); Review Comment: We could use `v - v.rem_euclid(width)` instead of `v - (((v % width) + width) % width)`. But it's also fine as it is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg] chandu-1101 commented on issue #8333: Unable to merge CDC data into snapshot data. java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang
chandu-1101 commented on issue #8333: URL: https://github.com/apache/iceberg/issues/8333#issuecomment-1711162146 @juanrondineau , @ext-diego-duenas Glad to see updates on this thread. @nastra even I don't think this is a schema issue! @juanrondineau Can you explain further on below > We materialize the new data as a table and then we run the merge In my case, I had a new iceberg table with the existing data from the snapshot added to it. meaning, the iceberg table has a single version --more like a materialized view (my guess of the materialized view is (a.)Format is iceberg (b.)Only 1 version of data is present). Please correct me if I am wrong. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [iceberg-rust] JanKaul commented on a diff in pull request #56: feat: support read Manifest List
JanKaul commented on code in PR #56: URL: https://github.com/apache/iceberg-rust/pull/56#discussion_r1319419895 ## crates/iceberg/src/spec/manifest_list.rs: ## @@ -0,0 +1,881 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! ManifestList for Iceberg. + +use crate::{avro::schema_to_avro_schema, spec::Literal, Error}; +use apache_avro::{from_value, types::Value, Reader}; +use once_cell::sync::Lazy; +use std::sync::Arc; + +use super::{FormatVersion, ListType, NestedField, NestedFieldRef, Schema, StructType}; + +/// Snapshots are embedded in table metadata, but the list of manifests for a +/// snapshot are stored in a separate manifest list file. +/// +/// A new manifest list is written for each attempt to commit a snapshot +/// because the list of manifests always changes to produce a new snapshot. +/// When a manifest list is written, the (optimistic) sequence number of the +/// snapshot is written for all new manifest files tracked by the list. +/// +/// A manifest list includes summary metadata that can be used to avoid +/// scanning all of the manifests in a snapshot when planning a table scan. +/// This includes the number of added, existing, and deleted files, and a +/// summary of values for each field of the partition spec used to write the +/// manifest. +#[derive(Debug, Clone)] +pub struct ManifestList { +/// Entries in a manifest list. +entries: Vec, +} + +impl ManifestList { +/// Parse manifest list from bytes. +/// +/// QUESTION: Will we have more than one manifest list in a single file? +pub fn parse_with_version( +bs: &[u8], +version: FormatVersion, +partition_type: &StructType, +) -> Result { +match version { +FormatVersion::V2 => { +let schema = schema_to_avro_schema("manifest_list", &Self::v2_schema()).unwrap(); +let reader = Reader::with_schema(&schema, bs)?; +let values = Value::Array(reader.collect::, _>>()?); + from_value::<_serde::ManifestListV2>(&values)?.try_into(partition_type) +} +FormatVersion::V1 => { +let schema = schema_to_avro_schema("manifest_list", &Self::v1_schema()).unwrap(); +let reader = Reader::with_schema(&schema, bs)?; +let values = Value::Array(reader.collect::, _>>()?); + from_value::<_serde::ManifestListV1>(&values)?.try_into(partition_type) +} +} +} + +/// Get the entries in the manifest list. +pub fn entries(&self) -> &[ManifestListEntry] { +&self.entries +} + +const MANIFEST_PATH: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +500, +"manifest_path", +super::Type::Primitive(super::PrimitiveType::String), +)) +}) +}; +const MANIFEST_LENGTH: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +501, +"manifest_length", +super::Type::Primitive(super::PrimitiveType::Long), +)) +}) +}; +const PARTITION_SPEC_ID: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +502, +"partition_spec_id", +super::Type::Primitive(super::PrimitiveType::Int), +)) +}) +}; +const CONTENT: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +517, +"content", +super::Type::Primitive(super::PrimitiveType::Int), +)) +}) +}; +const SEQUENCE_NUMBER: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +515, +"sequence_number", +super::Type::Primitive(super::PrimitiveType::Long), +)) +}) +}; +const MIN_SEQUENCE_NUMBER: Lazy = { +Lazy::new(|| { +Arc::new(NestedField::required( +516, +"min_sequence_number", +super::Type::Primitive(super::PrimitiveType::Long), +))
[GitHub] [iceberg-rust] JanKaul commented on a diff in pull request #56: feat: support read Manifest List
JanKaul commented on code in PR #56: URL: https://github.com/apache/iceberg-rust/pull/56#discussion_r1319441548 ## crates/iceberg/src/spec/manifest_list.rs: ## @@ -0,0 +1,581 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! ManifestList for Iceberg. + +use apache_avro::{from_value, Reader}; + +use crate::{spec::Literal, Error}; + +use super::{FormatVersion, StructType}; + +/// Snapshots are embedded in table metadata, but the list of manifests for a +/// snapshot are stored in a separate manifest list file. +/// +/// A new manifest list is written for each attempt to commit a snapshot +/// because the list of manifests always changes to produce a new snapshot. +/// When a manifest list is written, the (optimistic) sequence number of the +/// snapshot is written for all new manifest files tracked by the list. +/// +/// A manifest list includes summary metadata that can be used to avoid +/// scanning all of the manifests in a snapshot when planning a table scan. +/// This includes the number of added, existing, and deleted files, and a +/// summary of values for each field of the partition spec used to write the +/// manifest. +#[derive(Debug, Clone)] +pub struct ManifestList { +version: FormatVersion, +/// Entries in a manifest list. +entries: Vec, +} + +impl ManifestList { +/// Parse manifest list from bytes. +/// +/// QUESTION: Will we have more than one manifest list in a single file? Review Comment: I think there should only be one manifest list with many manifest list entries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org