Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
nastra commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1382910267 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/AvroUtil.java: ## @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.List; +import java.util.Map; +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.avro.AvroEncoderUtil; +import org.apache.iceberg.avro.AvroSchemaUtil; +import org.apache.iceberg.data.avro.DecoderResolver; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.types.Types; + +/** Class for Avro-related utility methods. */ +class AvroUtil { + static final Map FIELD_ID_TO_CLASS = + ImmutableMap.of( + DataComplete.ASSIGNMENTS_ELEMENT, + TopicPartitionOffset.class.getName(), + DataFile.PARTITION_ID, + PartitionData.class.getName(), + DataWritten.TABLE_REFERENCE, + TableReference.class.getName(), + DataWritten.DATA_FILES_ELEMENT, + "org.apache.iceberg.GenericDataFile", + DataWritten.DELETE_FILES_ELEMENT, + "org.apache.iceberg.GenericDeleteFile", + CommitToTable.TABLE_REFERENCE, + TableReference.class.getName()); + + public static byte[] encode(Event event) { +try { + return AvroEncoderUtil.encode(event, event.getSchema()); +} catch (IOException e) { + throw new UncheckedIOException(e); +} + } + + public static Event decode(byte[] bytes) { +try { + Event event = AvroEncoderUtil.decode(bytes); + // clear the cache to avoid memory leak + DecoderResolver.clearCache(); + return event; +} catch (IOException e) { + throw new UncheckedIOException(e); +} + } + + static Schema convert(Types.StructType icebergSchema, Class javaClass) { +return convert(icebergSchema, javaClass, FIELD_ID_TO_CLASS); + } + + static Schema convert( + Types.StructType icebergSchema, + Class javaClass, + Map typeMap) { +return AvroSchemaUtil.convert( +icebergSchema, +(fieldId, struct) -> +struct.equals(icebergSchema) ? javaClass.getName() : typeMap.get(fieldId)); + } + + static int positionToId(int position, Schema avroSchema) { +List fields = avroSchema.getFields(); +Preconditions.checkArgument(position < fields.size(), "Invalid field position: " + position); Review Comment: nit: should this also check that position isn't negative? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Clarify time travel implementation in Iceberg [iceberg]
nastra commented on code in PR #8982: URL: https://github.com/apache/iceberg/pull/8982#discussion_r1382914816 ## format/spec.md: ## @@ -1370,3 +1370,16 @@ Writing v2 metadata: * `sort_columns` was removed Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. + +## Appendix F: Implementation Notes + +This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification +to help maintain a uniform experience. + +### Point in Time Reads (Time Travel) + +Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories +might not be consistent for determining a table states at a given point in time due to a variety of table operations (e.g. branch-merge table workflows OR forcing the current state of table to specific snapshot ID). + +When processing point in time queries the Iceberg community has chosen to use "snapshot-log" metadata lookup the table state Review Comment: ```suggestion When processing point in time queries the Iceberg community has chosen to use "snapshot-log" metadata to lookup the table state ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Clarify time travel implementation in Iceberg [iceberg]
nastra commented on code in PR #8982: URL: https://github.com/apache/iceberg/pull/8982#discussion_r1382915561 ## format/spec.md: ## @@ -1370,3 +1370,16 @@ Writing v2 metadata: * `sort_columns` was removed Note that these requirements apply when writing data to a v2 table. Tables that are upgraded from v1 may contain metadata that does not follow these requirements. Implementations should remain backward-compatible with v1 metadata requirements. + +## Appendix F: Implementation Notes + +This section covers topics not required by the specification but recommendations for systems implementing the Iceberg specification +to help maintain a uniform experience. + +### Point in Time Reads (Time Travel) + +Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories Review Comment: ```suggestion Iceberg supports two types of histories for tables. A history of previous "current snapshots" stored in ["snapshot-log" table metadata](#table-metadata-fields) and [parent-child lineage stored in "snapshots"](#table-metadata-fields). These two histories ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] ICEBERG_CANNOT_OPEN_SPLIT: Error opening Iceberg split s3 [iceberg]
pawankukreja01 commented on issue #8427: URL: https://github.com/apache/iceberg/issues/8427#issuecomment-1794305805 This error occurs only in Athena, while running a query on the table using Spark works fine. According to the [Amazon EMR documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg.html), Apache Iceberg is an open table format for large data sets in Amazon Simple Storage Service (Amazon S3). It provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. Starting with Amazon EMR 6.5.0, you can use Apache Spark 3 on Amazon EMR clusters with the Iceberg table format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]
gaborkaszab commented on code in PR #8981: URL: https://github.com/apache/iceberg/pull/8981#discussion_r1382965493 ## format/spec.md: ## @@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` then `pos` to optimize Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row. -Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files. +Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). The column restrictions for columns used in equality delete files are the same as those for [identifier fields](#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a +parent struct column is null it implies the leaf column is null). Review Comment: What would be the meaning of a null in the list of equality IDs? If I'm not mistaken equality IDs are used for identifying which columns are used for the equality checks, but 'null column' doesn't make sense for me. If you mean null values in the equality dele files, that's a good question. I believe in SQL world NULL doesn't equal to any value, even NULL doesn't equal to NULL, so I wonder what would be the desired semantics when we find a NULL value in the equality delete file. Should we apply "IS NULL" on that particular column? But then it won't be an equality check as the name "equality delete" would suggest but an IS NULL check. I'd simply not allow NULL values in the delete files TBH. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Remove properties from `JdbcUtil` [iceberg]
Fokko opened a new issue, #8989: URL: https://github.com/apache/iceberg/issues/8989 ### Feature Request / Improvement There are some duplicate properties: https://github.com/apache/iceberg/blob/b0bf62a448617bd5f57ca72c2648452e6600fa20/core/src/main/java/org/apache/iceberg/jdbc/JdbcUtil.java#L44-L45 We can just re-use the ones in `BaseMetastoreTableOperations`: https://github.com/apache/iceberg/blob/b0bf62a448617bd5f57ca72c2648452e6600fa20/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L58 Since they are private in `JdbcUtil`, we can safely remove them. ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Duplicate file name in Iceberg's metadata [iceberg]
github-raphael-douyere commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1794358478 We enabled S3 versioning on the bucket and can see a file name being used 2 times by 2 distincts micro-batches. So it is not a case of task retry inside Spark. This issue leads to data loss as the original file is replaced and metadata corruption as there is a reference to a file that does not exists anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Ability to the write Metadata JSON [iceberg-python]
Fokko commented on issue #22: URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1794369668 @HonahX Yes, I think that is how it should be done (Except the builder pattern, that's very much Java style :). I think we can split the work into several work packages: - Ability to write the JSON to the object store (that was the intent of this PR). - Have logic to update the metadata dictionary as you pointed out above. I think we can do this per operation (update schema, update partition-spec, update sort-order, etc) to keep it small and we can get it in quickly. - Implement the commit method per catalog to update the properties (point to the latest metadata, and [set the previous metadata](https://github.com/apache/iceberg/blob/b0bf62a448617bd5f57ca72c2648452e6600fa20/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L58)). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] support partition spec update in pyiceberg [iceberg-python]
Fokko commented on issue #124: URL: https://github.com/apache/iceberg-python/issues/124#issuecomment-1794371184 Thanks for raising this! @puchengy which catalog are you using? Related is https://github.com/apache/iceberg-python/issues/22 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink write iceberg bug(org.apache.iceberg.exceptions.NotFoundException) [iceberg]
lirui-apache commented on issue #5846: URL: https://github.com/apache/iceberg/issues/5846#issuecomment-1794403680 Hi @chenwyi2 @pvary , thanks for the clarifications. Yeah we changed our snapshot expire routine to keep the last snapshot created by Flink. If this is a limitation by design, I guess it's better to mention this in the doc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump mypy-boto3-glue from 1.28.63 to 1.28.77 [iceberg-python]
Fokko merged PR #130: URL: https://github.com/apache/iceberg-python/pull/130 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Catch warning in PyLint tests [iceberg-python]
Fokko commented on code in PR #33: URL: https://github.com/apache/iceberg-python/pull/33#discussion_r1383041054 ## pyiceberg/manifest.py: ## @@ -783,8 +783,8 @@ def __init__(self, spec: PartitionSpec, schema: Schema, output_file: OutputFile, output_file, snapshot_id, { -"schema": schema.json(), -"partition-spec": spec.json(), +"schema": schema.model_dump_json(), Review Comment: `.json()` is deprecated and will be replaced with `.model_dump_json()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Catch warning in PyLint tests [iceberg-python]
Fokko merged PR #33: URL: https://github.com/apache/iceberg-python/pull/33 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Parquet: don't throw exception on row group filters when reading INT96 column [iceberg]
nastra commented on code in PR #8988: URL: https://github.com/apache/iceberg/pull/8988#discussion_r1383166362 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java: ## @@ -199,7 +201,7 @@ public Boolean lt(BoundReference ref, Literal lit) { int id = ref.fieldId(); Boolean hasNonDictPage = isFallback.get(id); - if (hasNonDictPage == null || hasNonDictPage) { + if (hasNonDictPage == null || hasNonDictPage || isInt96Column(id)) { Review Comment: I'm a bit sceptical about having this check here. Can you elaborate why this is the right place to perform the check? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Parquet: don't throw exception on row group filters when reading INT96 column [iceberg]
nastra commented on code in PR #8988: URL: https://github.com/apache/iceberg/pull/8988#discussion_r1383168479 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java: ## @@ -199,7 +201,7 @@ public Boolean lt(BoundReference ref, Literal lit) { int id = ref.fieldId(); Boolean hasNonDictPage = isFallback.get(id); - if (hasNonDictPage == null || hasNonDictPage) { + if (hasNonDictPage == null || hasNonDictPage || isInt96Column(id)) { Review Comment: I would have expected something like this: ``` --- a/parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java +++ b/parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java @@ -19,6 +19,7 @@ package org.apache.iceberg.parquet; import java.io.IOException; +import java.nio.ByteOrder; import java.util.Comparator; import java.util.Map; import java.util.Set; @@ -447,6 +448,13 @@ public class ParquetDictionaryRowGroupFilter { case INT64: dictSet.add((T) conversion.apply(dict.decodeToLong(i))); break; + case INT96: +dictSet.add( +(T) +conversion.apply( +ParquetUtil.extractTimestampInt96( + dict.decodeToBinary(i).toByteBuffer().order(ByteOrder.LITTLE_ENDIAN; +break; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump arrow from 13.0.0 to 14.0.0 [iceberg]
nastra merged PR #8984: URL: https://github.com/apache/iceberg/pull/8984 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump software.amazon.awssdk:bom from 2.21.10 to 2.21.15 [iceberg]
nastra merged PR #8983: URL: https://github.com/apache/iceberg/pull/8983 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383228406 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') Review Comment: Ahh, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383228989 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.d
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383232094 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.d
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383244841 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.d
Re: [I] Flink write iceberg bug(org.apache.iceberg.exceptions.NotFoundException) [iceberg]
pvary commented on issue #5846: URL: https://github.com/apache/iceberg/issues/5846#issuecomment-1794748179 @lirui-apache: Would you mind adding this to `docs/flink-writes.md`? I would be happy to review. Thanks, Peter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383290717 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.d
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383292093 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: Review Comment: Yes, this represents the `data_file` in the Manifest: https://iceberg.apache.org/spec/#manifests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383293089 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: Review Comment: Good one, I went for `add_file` and `remove_file`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383296211 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: Review Comment: When you append to a table, we'll append the existing manifests: https://github.com/apache/iceberg-python/pull/41/files#diff-23e8153e0fd497a9212215bd2067068f3b56fa071770c7ef326db3d3d03cee9bR1811-R1821 Any concerns? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383296718 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: Review Comment: Good one! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Iceberg streaming streaming-skip-overwrite-snapshots SparkMicroBatchStream only skips over one file per trigger [iceberg]
cccs-jc commented on code in PR #8980: URL: https://github.com/apache/iceberg/pull/8980#discussion_r1383313970 ## core/src/main/java/org/apache/iceberg/MicroBatches.java: ## @@ -92,7 +92,7 @@ private static List> indexManifests( for (ManifestFile manifest : manifestFiles) { manifestIndexes.add(Pair.of(manifest, currentFileIndex)); - currentFileIndex += manifest.addedFilesCount() + manifest.existingFilesCount(); + currentFileIndex += manifest.addedFilesCount(); Review Comment: I would but I don't know how? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383319743 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: Review Comment: I'm also okay with going with normal appends instead of fast-appends -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383329498 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,199 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +self.added_size += data_file.file_size_in_bytes +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: +if manifest.content == ManifestContent.DATA: +self.added_files += manifest.added_files_count or 0 +self.added_records += manifest.added_rows_count or 0 +self.removed_files += manifest.deleted_files_count or 0 +self.deleted_records += manifest.deleted_rows_count or 0 +elif manifest.content == ManifestContent.DELETES: +self.added_delete_files += manifest.added_files_count or 0 +self.removed_delete_files += manifest.deleted_files_count or 0 +else: +raise ValueError(f"Unknown manifest file content: {manifest.content}") + +def build(self) -> Dict[str, str]: +def set_non_zero(properties: Dict[str, str], num: int, property_name: str) -> None: +if num > 0: +properties[property_name] = str(num) + +properties: Dict[str, str] = {} +set_non_zero(properties, self.added_size, 'added-files-size') +set_non_zero(properties, self.removed_size, 'removed-files-size') +set_non_zero(properties, self.added_files, 'added-data-files') +set_non_zero(properties, self.removed_files, 'removed-data-files') +set_non_zero(properties, self.added_eq_delete_files, 'added-equality-delete-files') +set_non_zero(properties, self.removed_eq_delete_files, 'removed-equality-delete-files') +set_non_zero(properties, self.added_pos_delete_files, 'added-position-delete-files') +set_non_zero(properties, self.removed_pos_delete_files, 'removed-position-delete-files') +set_non_zero(properties, self.added_delete_files, 'added-delete-files') +set_non_zero(properties, self.removed_delete_files, 'removed-delete-files') +set_non_zero(properties, self.added_records, 'added-records') +set_non_zero(properties, self.deleted_records, 'deleted-records') +set_non_zero(pr
Re: [PR] Add Snapshot logic and Summary generation [iceberg-python]
Fokko commented on code in PR #61: URL: https://github.com/apache/iceberg-python/pull/61#discussion_r1383319743 ## pyiceberg/table/snapshots.py: ## @@ -116,3 +144,202 @@ class MetadataLogEntry(IcebergBaseModel): class SnapshotLogEntry(IcebergBaseModel): snapshot_id: int = Field(alias="snapshot-id") timestamp_ms: int = Field(alias="timestamp-ms") + + +class SnapshotSummaryCollector: +added_size: int +removed_size: int +added_files: int +removed_files: int +added_eq_delete_files: int +removed_eq_delete_files: int +added_pos_delete_files: int +removed_pos_delete_files: int +added_delete_files: int +removed_delete_files: int +added_records: int +deleted_records: int +added_pos_deletes: int +removed_pos_deletes: int +added_eq_deletes: int +removed_eq_deletes: int + +def __init__(self) -> None: +self.added_size = 0 +self.removed_size = 0 +self.added_files = 0 +self.removed_files = 0 +self.added_eq_delete_files = 0 +self.removed_eq_delete_files = 0 +self.added_pos_delete_files = 0 +self.removed_pos_delete_files = 0 +self.added_delete_files = 0 +self.removed_delete_files = 0 +self.added_records = 0 +self.deleted_records = 0 +self.added_pos_deletes = 0 +self.removed_pos_deletes = 0 +self.added_eq_deletes = 0 +self.removed_eq_deletes = 0 + +def add_file(self, data_file: DataFile) -> None: +self.added_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.added_files += 1 +self.added_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.added_delete_files += 1 +self.added_pos_delete_files += 1 +self.added_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.added_delete_files += 1 +self.added_eq_delete_files += 1 +self.added_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def removed_file(self, data_file: DataFile) -> None: +self.removed_size += data_file.file_size_in_bytes + +if data_file.content == DataFileContent.DATA: +self.removed_files += 1 +self.deleted_records += data_file.record_count +elif data_file.content == DataFileContent.POSITION_DELETES: +self.removed_delete_files += 1 +self.removed_pos_delete_files += 1 +self.removed_pos_deletes += data_file.record_count +elif data_file.content == DataFileContent.EQUALITY_DELETES: +self.removed_delete_files += 1 +self.removed_eq_delete_files += 1 +self.removed_eq_deletes += data_file.record_count +else: +raise ValueError(f"Unknown data file content: {data_file.content}") + +def added_manifest(self, manifest: ManifestFile) -> None: Review Comment: I'm also okay with going with normal appends instead of fast-appends. I've removed the method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383335036 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -264,6 +251,146 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } } + @Override + public boolean viewExists(TableIdentifier identifier) { +return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, TableType.VIRTUAL_VIEW); + } + + @Override + public boolean dropView(TableIdentifier identifier) { +if (!isValidIdentifier(identifier)) { + return false; +} +try { + String database = identifier.namespace().level(0); + String viewName = identifier.name(); + Table table = clients.run(client -> client.getTable(database, viewName)); + HiveCatalogUtil.validateTableIsIcebergView(table, fullTableName(name, identifier)); + clients.run( + client -> { +client.dropTable(database, viewName); +return null; + }); + LOG.info("Dropped View: {}", identifier); + return true; + +} catch (NoSuchViewException | NoSuchObjectException e) { + LOG.info("Skipping drop, View does not exist: {}", identifier, e); + return false; +} catch (TException e) { + throw new RuntimeException("Failed to drop " + identifier, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to dropView", e); +} + } + + @Override + public List listViews(Namespace namespace) { +try { + return listContents(namespace, TableType.VIRTUAL_VIEW.name(), icebergPredicate()); +} catch (UnknownDBException e) { + throw new NoSuchNamespaceException("Namespace does not exist: %s", namespace); + +} catch (TException e) { + throw new RuntimeException("Failed to list all views under namespace " + namespace, e); + +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to listViews", e); +} + } + + private List listContents( + Namespace namespace, String tableType, Predicate tablePredicate) + throws TException, InterruptedException { +Preconditions.checkArgument( +isValidateNamespace(namespace), "Missing database in namespace: %s", namespace); +String database = namespace.level(0); +List tableNames = +StringUtils.isNotEmpty(tableType) +? clients.run(client -> client.getTables(database, "*", TableType.valueOf(tableType))) +: clients.run(client -> client.getAllTables(database)); +List tableObjects = +clients.run(client -> client.getTableObjectsByName(database, tableNames)); Review Comment: This HMS call was not used before the patch when all tables were listed. After the patch we fetch all of the tables, even if it is not neccessary. I remember cases where there were too many tables in a database, and fetching all of the tables were problematic, so I would like us to find a better solution here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383336007 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -115,42 +126,13 @@ public void initialize(String inputName, Map properties) { @Override public List listTables(Namespace namespace) { -Preconditions.checkArgument( -isValidateNamespace(namespace), "Missing database in namespace: %s", namespace); -String database = namespace.level(0); try { - List tableNames = clients.run(client -> client.getAllTables(database)); - List tableIdentifiers; - if (listAllTables) { -tableIdentifiers = -tableNames.stream() -.map(t -> TableIdentifier.of(namespace, t)) -.collect(Collectors.toList()); +return listContents(namespace, null, table -> true); } else { -List tableObjects = -clients.run(client -> client.getTableObjectsByName(database, tableNames)); -tableIdentifiers = -tableObjects.stream() -.filter( -table -> -table.getParameters() != null -&& BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE -.equalsIgnoreCase( -table -.getParameters() - .get(BaseMetastoreTableOperations.TABLE_TYPE_PROP))) -.map(table -> TableIdentifier.of(namespace, table.getTableName())) -.collect(Collectors.toList()); +return listContents(namespace, TableType.EXTERNAL_TABLE.name(), icebergPredicate()); } - - LOG.debug( - "Listing of namespace: {} resulted in the following tables: {}", - namespace, - tableIdentifiers); Review Comment: Did we intentionally removed this log line? ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -115,42 +126,13 @@ public void initialize(String inputName, Map properties) { @Override public List listTables(Namespace namespace) { -Preconditions.checkArgument( -isValidateNamespace(namespace), "Missing database in namespace: %s", namespace); -String database = namespace.level(0); try { - List tableNames = clients.run(client -> client.getAllTables(database)); - List tableIdentifiers; - if (listAllTables) { -tableIdentifiers = -tableNames.stream() -.map(t -> TableIdentifier.of(namespace, t)) -.collect(Collectors.toList()); +return listContents(namespace, null, table -> true); } else { -List tableObjects = -clients.run(client -> client.getTableObjectsByName(database, tableNames)); -tableIdentifiers = -tableObjects.stream() -.filter( -table -> -table.getParameters() != null -&& BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE -.equalsIgnoreCase( -table -.getParameters() - .get(BaseMetastoreTableOperations.TABLE_TYPE_PROP))) -.map(table -> TableIdentifier.of(namespace, table.getTableName())) -.collect(Collectors.toList()); +return listContents(namespace, TableType.EXTERNAL_TABLE.name(), icebergPredicate()); } - - LOG.debug( - "Listing of namespace: {} resulted in the following tables: {}", - namespace, - tableIdentifiers); Review Comment: Did we intentionally remove this log line? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383338484 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -264,6 +251,146 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } } + @Override + public boolean viewExists(TableIdentifier identifier) { +return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, TableType.VIRTUAL_VIEW); + } + + @Override + public boolean dropView(TableIdentifier identifier) { +if (!isValidIdentifier(identifier)) { + return false; +} Review Comment: nit: newline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383343370 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -264,6 +251,146 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } } + @Override + public boolean viewExists(TableIdentifier identifier) { +return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, TableType.VIRTUAL_VIEW); + } + + @Override + public boolean dropView(TableIdentifier identifier) { +if (!isValidIdentifier(identifier)) { + return false; +} +try { + String database = identifier.namespace().level(0); + String viewName = identifier.name(); + Table table = clients.run(client -> client.getTable(database, viewName)); + HiveCatalogUtil.validateTableIsIcebergView(table, fullTableName(name, identifier)); + clients.run( + client -> { +client.dropTable(database, viewName); +return null; + }); + LOG.info("Dropped View: {}", identifier); + return true; + +} catch (NoSuchViewException | NoSuchObjectException e) { + LOG.info("Skipping drop, View does not exist: {}", identifier, e); + return false; +} catch (TException e) { + throw new RuntimeException("Failed to drop " + identifier, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to dropView", e); +} + } + + @Override + public List listViews(Namespace namespace) { +try { + return listContents(namespace, TableType.VIRTUAL_VIEW.name(), icebergPredicate()); +} catch (UnknownDBException e) { + throw new NoSuchNamespaceException("Namespace does not exist: %s", namespace); + +} catch (TException e) { + throw new RuntimeException("Failed to list all views under namespace " + namespace, e); + +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to listViews", e); +} + } + + private List listContents( + Namespace namespace, String tableType, Predicate tablePredicate) + throws TException, InterruptedException { +Preconditions.checkArgument( +isValidateNamespace(namespace), "Missing database in namespace: %s", namespace); +String database = namespace.level(0); +List tableNames = +StringUtils.isNotEmpty(tableType) +? clients.run(client -> client.getTables(database, "*", TableType.valueOf(tableType))) +: clients.run(client -> client.getAllTables(database)); +List tableObjects = +clients.run(client -> client.getTableObjectsByName(database, tableNames)); +List tableIdentifiers = +tableObjects.stream() +.filter(tablePredicate) +.map(table -> TableIdentifier.of(namespace, table.getTableName())) +.collect(Collectors.toList()); + +LOG.debug( +"Listing of namespace: {} for table type {} resulted in the following: {}", +namespace, +tableType, +tableIdentifiers); +return tableIdentifiers; + } + + private Predicate icebergPredicate() { +return table -> +table.getParameters() != null +&& BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.equalsIgnoreCase( + table.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP)); + } + + @Override + @SuppressWarnings("FormatStringAnnotation") + public void renameView(TableIdentifier from, TableIdentifier originalTo) { + +if (!isValidIdentifier(from)) { + throw new NoSuchViewException("Invalid identifier: %s", from); +} + +if (!namespaceExists(originalTo.namespace())) { + throw new NoSuchNamespaceException( + "Cannot rename %s to %s. Namespace does not exist: %s", + from, originalTo, originalTo.namespace()); +} + +TableIdentifier to = removeCatalogName(originalTo); +Preconditions.checkArgument(isValidIdentifier(to), "Invalid identifier: %s", to); + +String toDatabase = to.namespace().level(0); +String fromDatabase = from.namespace().level(0); +String fromName = from.name(); + +try { + Table fromView = clients.run(client -> client.getTable(fromDatabase, fromName)); + HiveCatalogUtil.validateTableIsIcebergView(fromView, fullTableName(name, from)); + if (tableExists(to)) { +LOG.warn("Cannot rename view {} to {}. Table {} already exists.", from, to, to); +throw new AlreadyExistsException( +String.format("Cannot rename %s to %s. Table already exists", from, to)); + } Review Comment: nit: newline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@ic
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383344359 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -264,6 +251,146 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } } + @Override + public boolean viewExists(TableIdentifier identifier) { +return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, TableType.VIRTUAL_VIEW); + } + + @Override + public boolean dropView(TableIdentifier identifier) { +if (!isValidIdentifier(identifier)) { + return false; +} +try { + String database = identifier.namespace().level(0); + String viewName = identifier.name(); + Table table = clients.run(client -> client.getTable(database, viewName)); + HiveCatalogUtil.validateTableIsIcebergView(table, fullTableName(name, identifier)); + clients.run( + client -> { +client.dropTable(database, viewName); +return null; + }); + LOG.info("Dropped View: {}", identifier); + return true; + +} catch (NoSuchViewException | NoSuchObjectException e) { + LOG.info("Skipping drop, View does not exist: {}", identifier, e); + return false; +} catch (TException e) { + throw new RuntimeException("Failed to drop " + identifier, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to dropView", e); +} + } + + @Override + public List listViews(Namespace namespace) { +try { + return listContents(namespace, TableType.VIRTUAL_VIEW.name(), icebergPredicate()); +} catch (UnknownDBException e) { + throw new NoSuchNamespaceException("Namespace does not exist: %s", namespace); + +} catch (TException e) { + throw new RuntimeException("Failed to list all views under namespace " + namespace, e); + +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to listViews", e); +} + } + + private List listContents( + Namespace namespace, String tableType, Predicate tablePredicate) + throws TException, InterruptedException { +Preconditions.checkArgument( +isValidateNamespace(namespace), "Missing database in namespace: %s", namespace); +String database = namespace.level(0); +List tableNames = +StringUtils.isNotEmpty(tableType) +? clients.run(client -> client.getTables(database, "*", TableType.valueOf(tableType))) +: clients.run(client -> client.getAllTables(database)); +List tableObjects = +clients.run(client -> client.getTableObjectsByName(database, tableNames)); +List tableIdentifiers = +tableObjects.stream() +.filter(tablePredicate) +.map(table -> TableIdentifier.of(namespace, table.getTableName())) +.collect(Collectors.toList()); + +LOG.debug( +"Listing of namespace: {} for table type {} resulted in the following: {}", +namespace, +tableType, +tableIdentifiers); +return tableIdentifiers; + } + + private Predicate icebergPredicate() { +return table -> +table.getParameters() != null +&& BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.equalsIgnoreCase( + table.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP)); + } + + @Override + @SuppressWarnings("FormatStringAnnotation") + public void renameView(TableIdentifier from, TableIdentifier originalTo) { + +if (!isValidIdentifier(from)) { + throw new NoSuchViewException("Invalid identifier: %s", from); +} + +if (!namespaceExists(originalTo.namespace())) { + throw new NoSuchNamespaceException( + "Cannot rename %s to %s. Namespace does not exist: %s", + from, originalTo, originalTo.namespace()); +} + +TableIdentifier to = removeCatalogName(originalTo); +Preconditions.checkArgument(isValidIdentifier(to), "Invalid identifier: %s", to); + +String toDatabase = to.namespace().level(0); +String fromDatabase = from.namespace().level(0); +String fromName = from.name(); + +try { + Table fromView = clients.run(client -> client.getTable(fromDatabase, fromName)); + HiveCatalogUtil.validateTableIsIcebergView(fromView, fullTableName(name, from)); + if (tableExists(to)) { +LOG.warn("Cannot rename view {} to {}. Table {} already exists.", from, to, to); +throw new AlreadyExistsException( +String.format("Cannot rename %s to %s. Table already exists", from, to)); + } + if (viewExists(to)) { Review Comment: We are double fetching the table from the HMS. Do we have a better solution? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383346182 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogUtil.java: ## @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.List; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchIcebergTableException; +import org.apache.iceberg.exceptions.NoSuchIcebergViewException; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** A utility class to validate Hive Iceberg Table and Views. */ +final class HiveCatalogUtil { + + private static final Logger LOG = LoggerFactory.getLogger(HiveCatalogUtil.class); + + // the max size is based on HMS backend database. For Hive versions below 2.3, the max table + // parameter size is 4000 + // characters, see https://issues.apache.org/jira/browse/HIVE-12274 + // set to 0 to not expose Iceberg metadata in HMS Table properties. Review Comment: I do not get this comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383350384 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogUtil.java: ## @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.List; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchIcebergTableException; +import org.apache.iceberg.exceptions.NoSuchIcebergViewException; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** A utility class to validate Hive Iceberg Table and Views. */ +final class HiveCatalogUtil { + + private static final Logger LOG = LoggerFactory.getLogger(HiveCatalogUtil.class); + + // the max size is based on HMS backend database. For Hive versions below 2.3, the max table + // parameter size is 4000 + // characters, see https://issues.apache.org/jira/browse/HIVE-12274 + // set to 0 to not expose Iceberg metadata in HMS Table properties. + static final String HIVE_TABLE_PROPERTY_MAX_SIZE = "iceberg.hive.table-property-max-size"; + static final long HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT = 32672; + + private HiveCatalogUtil() { +// empty constructor for utility class + } + + static boolean isTableWithTypeExists( + ClientPool clients, + TableIdentifier identifier, + TableType tableType) { +String database = identifier.namespace().level(0); +String tableName = identifier.name(); +try { + List tables = clients.run(client -> client.getTables(database, tableName, tableType)); + return !tables.isEmpty(); +} catch (TException e) { + throw new RuntimeException( + "Failed to check table existence " + database + "." + tableName, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to listTables", e); +} + } + + static void validateTableIsIcebergView(Table table, String fullName) { +String tableType = table.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP); +NoSuchIcebergViewException.check( +table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name()) +&& tableType != null +&& tableType.equalsIgnoreCase(BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE), +"Not an iceberg view: %s (type=%s) (tableType=%s)", +fullName, +tableType, +table.getTableType()); + } + + static void validateTableIsIceberg(Table table, String fullName) { +if (table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { + throw new AlreadyExistsException( + "View with same name already exists: %s.%s", table.getDbName(), table.getTableName()); +} Review Comment: nit: newline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383356758 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogUtil.java: ## @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.List; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchIcebergTableException; +import org.apache.iceberg.exceptions.NoSuchIcebergViewException; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** A utility class to validate Hive Iceberg Table and Views. */ +final class HiveCatalogUtil { + + private static final Logger LOG = LoggerFactory.getLogger(HiveCatalogUtil.class); + + // the max size is based on HMS backend database. For Hive versions below 2.3, the max table + // parameter size is 4000 + // characters, see https://issues.apache.org/jira/browse/HIVE-12274 + // set to 0 to not expose Iceberg metadata in HMS Table properties. + static final String HIVE_TABLE_PROPERTY_MAX_SIZE = "iceberg.hive.table-property-max-size"; + static final long HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT = 32672; + + private HiveCatalogUtil() { +// empty constructor for utility class + } + + static boolean isTableWithTypeExists( + ClientPool clients, + TableIdentifier identifier, + TableType tableType) { +String database = identifier.namespace().level(0); +String tableName = identifier.name(); +try { + List tables = clients.run(client -> client.getTables(database, tableName, tableType)); + return !tables.isEmpty(); +} catch (TException e) { + throw new RuntimeException( + "Failed to check table existence " + database + "." + tableName, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted in call to listTables", e); +} + } + + static void validateTableIsIcebergView(Table table, String fullName) { +String tableType = table.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP); +NoSuchIcebergViewException.check( +table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name()) +&& tableType != null +&& tableType.equalsIgnoreCase(BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE), +"Not an iceberg view: %s (type=%s) (tableType=%s)", +fullName, +tableType, +table.getTableType()); + } + + static void validateTableIsIceberg(Table table, String fullName) { +if (table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { + throw new AlreadyExistsException( Review Comment: This is a new type of exception we are throwing here. We need to double check, that it is handled correctly everywhere -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383360341 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java: ## @@ -0,0 +1,389 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Collections; +import java.util.Locale; +import java.util.Map; +import java.util.Objects; +import java.util.Optional; +import java.util.concurrent.atomic.AtomicReference; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.InvalidObjectException; +import org.apache.hadoop.hive.metastore.api.NoSuchObjectException; +import org.apache.hadoop.hive.metastore.api.SerDeInfo; +import org.apache.hadoop.hive.metastore.api.StorageDescriptor; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.SchemaParser; +import org.apache.iceberg.TableProperties; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.CommitFailedException; +import org.apache.iceberg.exceptions.CommitStateUnknownException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.util.Tasks; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** Hive implementation of Iceberg ViewOperations. */ +final class HiveViewOperations extends BaseViewOperations { + private static final Logger LOG = LoggerFactory.getLogger(HiveViewOperations.class); + + private final String fullName; + private final String database; + private final String viewName; + private final FileIO fileIO; + private final ClientPool metaClients; + private final long maxHiveTablePropertySize; + private final TableIdentifier identifier; + + HiveViewOperations( + Configuration conf, + ClientPool metaClients, + FileIO fileIO, + String catalogName, + TableIdentifier viewIdentifier) { +this.identifier = viewIdentifier; +String dbName = viewIdentifier.namespace().level(0); +this.metaClients = metaClients; +this.fileIO = fileIO; +this.fullName = catalogName + "." + dbName + "." + viewIdentifier.name(); +this.database = dbName; +this.viewName = viewIdentifier.name(); +this.maxHiveTablePropertySize = +conf.getLong( +HiveCatalogUtil.HIVE_TABLE_PROPERTY_MAX_SIZE, +HiveCatalogUtil.HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT); + } + + @Override + public ViewMetadata current() { +if (HiveCatalogUtil.isTableWithTypeExists(metaClients, identifier, TableType.EXTERNAL_TABLE)) { + throw new AlreadyExistsException( + "Table with same name already exists: %s.%s", database, viewName); +} +return super.current(); + } + + @Override + public void doRefresh() { +String metadataLocation = null; +try { + Table table = metaClients.run(client -> client.getTable(database, viewName)); + HiveCatalogUtil.validateTableIsIcebergView(table, fullName); + metadataLocation = + table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP); +} catch (NoSuchObjectException e) { + if (currentMetadataLocation() != null) { +throw new NoSuchViewException("View does not exist: %s.%s", database, viewName); + } +} catch (TException e) { + String errMsg = + String.format("Failed to get view info from metastore %s.%s", database,
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1383362391 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java: ## @@ -0,0 +1,389 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Collections; +import java.util.Locale; +import java.util.Map; +import java.util.Objects; +import java.util.Optional; +import java.util.concurrent.atomic.AtomicReference; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.InvalidObjectException; +import org.apache.hadoop.hive.metastore.api.NoSuchObjectException; +import org.apache.hadoop.hive.metastore.api.SerDeInfo; +import org.apache.hadoop.hive.metastore.api.StorageDescriptor; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.SchemaParser; +import org.apache.iceberg.TableProperties; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.CommitFailedException; +import org.apache.iceberg.exceptions.CommitStateUnknownException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.util.Tasks; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** Hive implementation of Iceberg ViewOperations. */ +final class HiveViewOperations extends BaseViewOperations { Review Comment: How big is the duplicated code here? Do we want to have a common ancestor for `HiveViewOperations` and `HiveTableOperations`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
ajantha-bhat commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1383363198 ## core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java: ## @@ -400,8 +400,15 @@ public void replaceTableViaTransactionThatAlreadyExistsAsView() { .buildTable(viewIdentifier, SCHEMA) .replaceTransaction() .commitTransaction()) -.isInstanceOf(NoSuchTableException.class) -.hasMessageStartingWith("Table does not exist: ns.view"); +.satisfiesAnyOf( +throwable -> +assertThat(throwable) +.isInstanceOf(NoSuchTableException.class) +.hasMessageStartingWith("Table does not exist: ns.view"), +throwable -> +assertThat(throwable) Review Comment: `replaceTableViaTransactionThatAlreadyExistsAsView` `NessieTableOperations.doRefresh() `--> throws `AlreadyExistsException`. But expecting `NoSuchViewException`. If I fix it, another test case (createOrReplaceTableViaTransactionThatAlreadyExistsAsView) fails. Because from the same place `doRefresh()` we are expecting two different kind of exceptions. I think test case need to be modified instead of unifying code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
pvary commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383401740 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -261,6 +261,12 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new RuntimeException("Interrupted in call to rename", e); +} catch (RuntimeException e) { Review Comment: Taking a quick look, I do not like this change. Why is this `RuntimeException`? This code tries to capture this case: ``` } catch (AlreadyExistsException e) { throw new org.apache.iceberg.exceptions.AlreadyExistsException( "Table already exists: %s", to); } ``` Do we have checks for the other cases as well? My first guess would be that `MetastoreUtil.alterTable` messes up the Hive exceptions, and maybe all of them are off? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
pvary commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383403517 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -261,6 +261,12 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new RuntimeException("Interrupted in call to rename", e); +} catch (RuntimeException e) { Review Comment: Also maybe the exceptions thrown are different with different Hive versions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
pvary commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383411335 ## hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreTest.java: ## @@ -31,6 +31,10 @@ import org.junit.jupiter.api.AfterAll; import org.junit.jupiter.api.BeforeAll; +/* + * This meta-setup has been deprecated; use {@link HiveMetastoreExtension} instead. + * */ Review Comment: nit: `* */` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
pvary commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383409990 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -500,6 +511,9 @@ protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) { return String.format("%s/%s", databaseData.getLocationUri(), tableIdentifier.name()); } +} catch (NoSuchObjectException e) { + throw new NoSuchNamespaceException( + e, "Namespace does not exist: %s", tableIdentifier.namespace().levels()[0]); Review Comment: Just to double check, the exceptions might be different with Hive2 and Hive3. Do we run the tests with both Hive2 and Hive3 on CI? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
pvary commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1383413298 ## hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreTest.java: ## @@ -31,6 +31,10 @@ import org.junit.jupiter.api.AfterAll; import org.junit.jupiter.api.BeforeAll; +/* + * This meta-setup has been deprecated; use {@link HiveMetastoreExtension} instead. + * */ +@Deprecated Review Comment: After this could you please create a PR to remove this class? Or in this one, if it is trivial... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] docs(readme): Add feature roadmap and support to readme [iceberg-go]
zeroshade opened a new pull request, #32: URL: https://github.com/apache/iceberg-go/pull/32 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs(readme): Add feature roadmap and support to readme [iceberg-go]
zeroshade commented on PR #32: URL: https://github.com/apache/iceberg-go/pull/32#issuecomment-1794974142 CC @nastra @Fokko @rdblue @coded9 @bitsondatadev @wolfeidau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Cannot decode dictionary of type INT96 when reading imported Spark parquet table [iceberg]
manuzhang opened a new issue, #8990: URL: https://github.com/apache/iceberg/issues/8990 ### Apache Iceberg version 1.2.1 ### Query engine Spark ### Please describe the bug 🐞 The following exception was thrown when reading an imported Spark parquet table with filter. ``` java.lang.IllegalArgumentException: Cannot decode dictionary of type: INT96 at org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.dict(ParquetDictionaryRowGroupFilter.java:458) at org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.eq(ParquetDictionaryRowGroupFilter.java:293) at org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.eq(ParquetDictionaryRowGroupFilter.java:80) at org.apache.iceberg.expressions.ExpressionVisitors$BoundExpressionVisitor.predicate(ExpressionVisitors.java:162) at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390) at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409) at org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.eval(ParquetDictionaryRowGroupFilter.java:118) at org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter$EvalVisitor.access$100(ParquetDictionaryRowGroupFilter.java:80) at org.apache.iceberg.parquet.ParquetDictionaryRowGroupFilter.shouldRead(ParquetDictionaryRowGroupFilter.java:74) at org.apache.iceberg.parquet.ReadConf.(ReadConf.java:119) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] docs(readme): Add feature roadmap and support to readme [iceberg-go]
nastra merged PR #32: URL: https://github.com/apache/iceberg-go/pull/32 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] support partition spec update in pyiceberg [iceberg-python]
puchengy commented on issue #124: URL: https://github.com/apache/iceberg-python/issues/124#issuecomment-1795046924 @Fokko Hi, we are interested in Hive catalog and rest catalog in future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Iceberg streaming streaming-skip-overwrite-snapshots SparkMicroBatchStream only skips over one file per trigger [iceberg]
nastra commented on code in PR #8980: URL: https://github.com/apache/iceberg/pull/8980#discussion_r1383492951 ## spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java: ## @@ -497,6 +500,67 @@ public void testReadStreamWithSnapshotTypeOverwriteErrorsOut() throws Exception .hasMessageStartingWith("Cannot process overwrite snapshot"); } + @Test + public void testReadStreamWithSnapshotTypeRewriteDataFilesIgnoresReplace() throws Exception { +// fill table with some data +List> expected = TEST_DATA_MULTIPLE_SNAPSHOTS; +appendDataAsMultipleSnapshots(expected); + +makeRewriteDataFiles(); + +Iterable snapshots = table.snapshots(); +for (Snapshot s : snapshots) { + System.out.println(s.snapshotId()); +} + +Assert.assertEquals( +6, +microBatchCount( + ImmutableMap.of(SparkReadOptions.STREAMING_MAX_FILES_PER_MICRO_BATCH, "1"))); + } + + @Test + public void testReadStreamWithSnapshotType2RewriteDataFilesIgnoresReplace() throws Exception { +// fill table with some data +List> expected = TEST_DATA_MULTIPLE_SNAPSHOTS; +appendDataAsMultipleSnapshots(expected); + +makeRewriteDataFiles(); +makeRewriteDataFiles(); + +Iterable snapshots = table.snapshots(); +for (Snapshot s : snapshots) { + System.out.println(s.snapshotId()); +} + +Assert.assertEquals( +6, +microBatchCount( + ImmutableMap.of(SparkReadOptions.STREAMING_MAX_FILES_PER_MICRO_BATCH, "1"))); + } + + @Test + public void testReadStreamWithSnapshotTypeRewriteDataFilesIgnoresReplaceFollowedByAppend() + throws Exception { +// fill table with some data +List> expected = TEST_DATA_MULTIPLE_SNAPSHOTS; +appendDataAsMultipleSnapshots(expected); + +makeRewriteDataFiles(); + +appendDataAsMultipleSnapshots(expected); + +Iterable snapshots = table.snapshots(); +for (Snapshot s : snapshots) { + System.out.println(s.snapshotId()); +} + +Assert.assertEquals( Review Comment: please use AssertJ-style assertions for newly written test code: https://iceberg.apache.org/contribute/#assertj -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Kafka Connect: Initial project setup and event data structures [iceberg]
bryanck commented on code in PR #8701: URL: https://github.com/apache/iceberg/pull/8701#discussion_r1383589051 ## kafka-connect/kafka-connect-events/src/main/java/org/apache/iceberg/connect/events/AvroUtil.java: ## @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.connect.events; + +import java.io.IOException; +import java.io.UncheckedIOException; +import java.util.List; +import java.util.Map; +import org.apache.avro.Schema; +import org.apache.avro.generic.IndexedRecord; +import org.apache.iceberg.DataFile; +import org.apache.iceberg.PartitionData; +import org.apache.iceberg.avro.AvroEncoderUtil; +import org.apache.iceberg.avro.AvroSchemaUtil; +import org.apache.iceberg.data.avro.DecoderResolver; +import org.apache.iceberg.relocated.com.google.common.base.Preconditions; +import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap; +import org.apache.iceberg.types.Types; + +/** Class for Avro-related utility methods. */ +class AvroUtil { + static final Map FIELD_ID_TO_CLASS = + ImmutableMap.of( + DataComplete.ASSIGNMENTS_ELEMENT, + TopicPartitionOffset.class.getName(), + DataFile.PARTITION_ID, + PartitionData.class.getName(), + DataWritten.TABLE_REFERENCE, + TableReference.class.getName(), + DataWritten.DATA_FILES_ELEMENT, + "org.apache.iceberg.GenericDataFile", + DataWritten.DELETE_FILES_ELEMENT, + "org.apache.iceberg.GenericDeleteFile", + CommitToTable.TABLE_REFERENCE, + TableReference.class.getName()); + + public static byte[] encode(Event event) { +try { + return AvroEncoderUtil.encode(event, event.getSchema()); +} catch (IOException e) { + throw new UncheckedIOException(e); +} + } + + public static Event decode(byte[] bytes) { +try { + Event event = AvroEncoderUtil.decode(bytes); + // clear the cache to avoid memory leak + DecoderResolver.clearCache(); + return event; +} catch (IOException e) { + throw new UncheckedIOException(e); +} + } + + static Schema convert(Types.StructType icebergSchema, Class javaClass) { +return convert(icebergSchema, javaClass, FIELD_ID_TO_CLASS); + } + + static Schema convert( + Types.StructType icebergSchema, + Class javaClass, + Map typeMap) { +return AvroSchemaUtil.convert( +icebergSchema, +(fieldId, struct) -> +struct.equals(icebergSchema) ? javaClass.getName() : typeMap.get(fieldId)); + } + + static int positionToId(int position, Schema avroSchema) { +List fields = avroSchema.getFields(); +Preconditions.checkArgument(position < fields.size(), "Invalid field position: " + position); Review Comment: yes, I added this check also, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Docs: Update site-docs/spark-quickstart.md [iceberg]
stavdav143 opened a new pull request, #8991: URL: https://github.com/apache/iceberg/pull/8991 Local volume with warehouse/notebooks to be mounted on Minio service instead of Spark -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
pvary commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1383682090 ## api/src/main/java/org/apache/iceberg/Scan.java: ## @@ -77,6 +78,21 @@ public interface Scan> { */ ThisT includeColumnStats(); + /** + * Create a new scan from this that loads the column stats for the specific columns with each data + * file. If the columns set is empty or null then all column stats will be kept, if + * {@link #includeColumnStats()} is set. + * + * Column stats include: value count, null value count, lower bounds, and upper bounds. + * + * @param columnsToKeepStats column ids from the table's schema + * @return a new scan based on this that loads column stats for specific columns. + */ + default ThisT columnsToKeepStats(Set columnsToKeepStats) { Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
pvary commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1383682326 ## core/src/main/java/org/apache/iceberg/GenericDataFile.java: ## @@ -66,23 +68,31 @@ class GenericDataFile extends BaseFile implements DataFile { * Copy constructor. * * @param toCopy a generic data file to copy. - * @param fullCopy whether to copy all fields or to drop column-level stats + * @param copyStats whether to copy all fields or to drop column-level stats. + * @param columnsToKeepStats a set of column ids to keep stats. If empty or null then + * every column stat is kept. */ - private GenericDataFile(GenericDataFile toCopy, boolean fullCopy) { -super(toCopy, fullCopy); + private GenericDataFile( + GenericDataFile toCopy, boolean copyStats, Set columnsToKeepStats) { Review Comment: Done ## core/src/main/java/org/apache/iceberg/GenericDataFile.java: ## @@ -66,23 +68,31 @@ class GenericDataFile extends BaseFile implements DataFile { * Copy constructor. * * @param toCopy a generic data file to copy. - * @param fullCopy whether to copy all fields or to drop column-level stats + * @param copyStats whether to copy all fields or to drop column-level stats. + * @param columnsToKeepStats a set of column ids to keep stats. If empty or null then + * every column stat is kept. */ - private GenericDataFile(GenericDataFile toCopy, boolean fullCopy) { -super(toCopy, fullCopy); + private GenericDataFile( + GenericDataFile toCopy, boolean copyStats, Set columnsToKeepStats) { +super(toCopy, copyStats, columnsToKeepStats); } /** Constructor for Java serialization. */ GenericDataFile() {} @Override public DataFile copyWithoutStats() { -return new GenericDataFile(this, false /* drop stats */); +return new GenericDataFile(this, false /* drop stats */, ImmutableSet.of()); + } + + @Override + public DataFile copyWithStats(Set columnsToKeepStats) { +return new GenericDataFile(this, true, columnsToKeepStats); } @Override public DataFile copy() { -return new GenericDataFile(this, true /* full copy */); +return new GenericDataFile(this, true /* full copy */, ImmutableSet.of()); Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
pvary commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1383682641 ## api/src/main/java/org/apache/iceberg/Scan.java: ## @@ -77,6 +78,21 @@ public interface Scan> { */ ThisT includeColumnStats(); + /** + * Create a new scan from this that loads the column stats for the specific columns with each data + * file. If the columns set is empty or null then all column stats will be kept, if + * {@link #includeColumnStats()} is set. + * + * Column stats include: value count, null value count, lower bounds, and upper bounds. + * + * @param columnsToKeepStats column ids from the table's schema Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Enable column statistics filtering after planning [iceberg]
pvary commented on code in PR #8803: URL: https://github.com/apache/iceberg/pull/8803#discussion_r1383715528 ## core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java: ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.util; + +import java.util.Set; +import org.apache.iceberg.ContentFile; + +public class ContentFileUtil { + private ContentFileUtil() {} + + /** + * Copies the {@link ContentFile} with the specific stat settings. + * + * @param file a generic data file to copy. + * @param withStats whether to keep any stats + * @param columnsToKeepStats a set of column ids to keep stats. If empty or null then + * every column stat is kept. Review Comment: I converted the `Scan` to use the column names instead of the column ids, like: `Scan.includeColumnStats(Collection requestedColumns)` For the `ContentFile` we do not have the `Schema` at hand, so I had to stick to the `ContentFile.copyWithStats(Set requestedColumnIds)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Implement pre-existing session support for dynamodb catalog [iceberg-python]
waifairer commented on code in PR #104: URL: https://github.com/apache/iceberg-python/pull/104#discussion_r1383886741 ## mkdocs/docs/configuration.md: ## @@ -195,6 +195,19 @@ catalog: table-name: iceberg ``` +If you prefer to pass the credentials explicitly to the client instead of relying on environment variables, + +```yaml +catalog: + default: +type: dynamodb +table-name: iceberg Review Comment: @HonahX Definitely agreed with `dynamo` as a prefix. As for hyphens vs underscores, AWS is _really_ consistent about using underscores. I'm of the opinion that the AWS-based credentials should support both underscores and hyphens, will prefer hyphens if present, but fall back to the underscore usages if necessary. Documentation should only present the hyphenated case as an option. I believe that this strategy would lead to the least number of "head banging" debug sessions. However, I think a reasonable case could be made to remove underscore support instead of supporting it as a fallback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Adding new columns (mergeSchema) [iceberg]
FabricioZGalvani commented on issue #8908: URL: https://github.com/apache/iceberg/issues/8908#issuecomment-1796301185 Hello everyone, After several attempts, I managed to solve the mergeSchema issue I was facing. The solution was to apply the following configuration. I suspect that the check-ordering is set to true by default, which can cause errors when columns are out of order. This happens even when following the documentation's instructions. Without applying the configuration below, I was receiving the 'column is out of order' error. `--conf spark.sql.iceberg.check-ordering=false` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]
emkornfield commented on code in PR #8981: URL: https://github.com/apache/iceberg/pull/8981#discussion_r1384024269 ## format/spec.md: ## @@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` then `pos` to optimize Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row. -Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files. +Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). The column restrictions for columns used in equality delete files are the same as those for [identifier fields](#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a +parent struct column is null it implies the leaf column is null). Review Comment: The definition of null values present is already explained in the paragraph starting on line [850](https://github.com/apache/iceberg/pull/8981/files#diff-36347a47c3bf67ea2ef6309ea96201814032d21bb5f162dfae4045508c15588aR850), please let me know if you think this this is insufficient, or if there is some word smithing needed (I'm open to suggestions). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]
emkornfield commented on code in PR #8981: URL: https://github.com/apache/iceberg/pull/8981#discussion_r1384024269 ## format/spec.md: ## @@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` then `pos` to optimize Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row. -Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files. +Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). The column restrictions for columns used in equality delete files are the same as those for [identifier fields](#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a +parent struct column is null it implies the leaf column is null). Review Comment: The definition of null values present is already explained in the paragraph starting on line 850, please let me know if you think this this is insufficient, or if there is some word smithing needed (I'm open to suggestions). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] [wip] de-dup props [iceberg]
thomaschow opened a new pull request, #8992: URL: https://github.com/apache/iceberg/pull/8992 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Build: Bump mkdocs-material from 9.4.7 to 9.4.8 [iceberg-python]
dependabot[bot] opened a new pull request, #131: URL: https://github.com/apache/iceberg-python/pull/131 Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.4.7 to 9.4.8. Release notes Sourced from https://github.com/squidfunk/mkdocs-material/releases";>mkdocs-material's releases. mkdocs-material-9.4.8 Fixed invalid local address replacement when using instant loading Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6275";>#6275: Crash after navigation caused 404 when using instant loading Changelog Sourced from https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG";>mkdocs-material's changelog. mkdocs-material-9.4.8+insiders-4.43.0 (2023-11-05) Added support for GitLab committers (document contributors) Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6264";>#6264: Fixed compatibility with Python < 3.10 Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6254";>#6254: Meta plugin not applying meta files to blog posts mkdocs-material-9.4.8 (2023-11-05) Fixed invalid local address replacement when using instant loading Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6275";>#6275: Crash after navigation caused 404 when using instant loading mkdocs-material-9.4.7+insiders-4.42.3 (2023-10-27) Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6251";>#6251: Cards in grids cut off on very small screens Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6241";>#6241: Using social plugin + static-i18n plugin errors mkdocs-material-9.4.7 (2023-10-27) Added Azerbaijani translations mkdocs-material-9.4.6+insiders-4.42.2 (2023-10-14) Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6186";>#6186: Privacy plugin ignores hash fragments on images Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6180";>#6180: Projects plugin crashing when adding or removing files mkdocs-material-9.4.6 (2023-10-14) Updated Danish and Norwegian (Nynorsk) translations Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6169";>#6169: Blog post metadata layout overflows on small screens mkdocs-material-9.4.5 (2023-10-10) Fixed sidebar auto-positioning (9.4.2 regression) Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6166";>#6166: Improve group plugin compatibility with Python < 3.10 Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6157";>#6157: Hiding tags does not work (9.4.3 regression) mkdocs-material-9.4.4+insiders-4.42.1 (2023-10-05) Fixed spacing of related links in blog posts on small screens mkdocs-material-9.4.4 (2023-10-05) Added support for overriding text to be copied for code blocks Fixed broken layout in some browsers at breakpoints when using zoom Fixed https://redirect.github.com/squidfunk/mkdocs-material/issues/6132";>#6132: Incomplete search highlighting for code blocks in titles mkdocs-material-9.4.3 (2023-10-02) Added support for instant navigation progress indicator Improved spacing and alignment of tags ... (truncated) Commits https://github.com/squidfunk/mkdocs-material/commit/c0755bf2471a3476d9592de99384abe476c4a645";>c0755bf Prepare 9.4.8 release https://github.com/squidfunk/mkdocs-material/commit/fabc9bd6b8c50001b2e2a3d9b8ee9b1fd53c5cfe";>fabc9bd Documentation https://github.com/squidfunk/mkdocs-material/commit/87d69a95b8284d60233d94563953ecf7b1dfc7ec";>87d69a9 Fixed invalid local address when using instant loading https://github.com/squidfunk/mkdocs-material/commit/9a7a185f9d1d7cc4b47609846deeea26c04bd3b2";>9a7a185 Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/6267";>#6267) https://github.com/squidfunk/mkdocs-material/commit/7353c7d7cf862dec278d115bd4dbd892952d4111";>7353c7d Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/6277";>#6277) https://github.com/squidfunk/mkdocs-material/commit/ca5f5174a312fbf80e1e1ad275d84dbebdabf4cd";>ca5f517 Merge branch 'master' of github.com:squidfunk/mkdocs-material https://github.com/squidfunk/mkdocs-material/commit/494cae1e36664a5d106c1371b05e74f90703b919";>494cae1 Fixed crash after navigation caused 404 when using instant loading https://github.com/squidfunk/mkdocs-material/commit/1698708b2329980453da332a2aafa39b0f654653";>1698708 Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/6260";>#6260) https://github.com/squidfunk/mkdocs-material/commit/551d98e6de12e80e6da734bb3ff2dbc85d4adf5b";>551d98e Documentation (https://redirect.github.com/squidfunk/mkdocs-material/issues/6222";>#6222) https://github.com/squidfunk/mkdocs-material/commit/dfa5f0313893ff7fc254a8d74421735d5f1d3e
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384122853 ## api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java: ## @@ -589,17 +603,17 @@ private static String sanitizeNumber(Number value, String type) { return "(" + numDigits + "-digit-" + type + ")"; } - private static String sanitizeString(CharSequence value, long now, int today) { + private static String sanitizeString(CharSequence value, long nowMillis, int today) { try { if (DATE.matcher(value).matches()) { Literal date = Literal.of(value).to(Types.DateType.get()); return sanitizeDate(date.value(), today); } else if (TIMESTAMP.matcher(value).matches()) { -Literal ts = Literal.of(value).to(Types.TimestampType.withoutZone()); -return sanitizeTimestamp(ts.value(), now); +Literal ts = Literal.of(value).to(Types.TimestampType.nanosWithoutZone()); Review Comment: Resolved in 1a6cf52c1baa35e1fdd086d8907fbed7873cf317 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958 ## api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java: ## @@ -121,17 +121,13 @@ static R visit(Schema schema, PartitionField field, PartitionSpecVisitor } else if (transform instanceof Truncate) { int width = ((Truncate) transform).width(); return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), width); -} else if (transform == Dates.YEAR -|| transform == Timestamps.YEAR -|| transform instanceof Years) { +} else if ("year".equalsIgnoreCase(transform.toString())) { Review Comment: You're right. Fixed in 09c1f2534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384122853 ## api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java: ## @@ -589,17 +603,17 @@ private static String sanitizeNumber(Number value, String type) { return "(" + numDigits + "-digit-" + type + ")"; } - private static String sanitizeString(CharSequence value, long now, int today) { + private static String sanitizeString(CharSequence value, long nowMillis, int today) { try { if (DATE.matcher(value).matches()) { Literal date = Literal.of(value).to(Types.DateType.get()); return sanitizeDate(date.value(), today); } else if (TIMESTAMP.matcher(value).matches()) { -Literal ts = Literal.of(value).to(Types.TimestampType.withoutZone()); -return sanitizeTimestamp(ts.value(), now); +Literal ts = Literal.of(value).to(Types.TimestampType.nanosWithoutZone()); Review Comment: Resolved in 1f95ceb31 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1382165939 ## api/src/main/java/org/apache/iceberg/transforms/Days.java: ## @@ -55,14 +56,14 @@ public boolean satisfiesOrderOf(Transform other) { } if (other instanceof Timestamps) { - return Timestamps.DAY.satisfiesOrderOf(other); + return ((Timestamps) other).getResultTypeUnit() == ChronoUnit.DAYS Review Comment: Fixed in 3b87435b3 ## api/src/main/java/org/apache/iceberg/transforms/Hours.java: ## @@ -57,15 +58,16 @@ public boolean satisfiesOrderOf(Transform other) { } if (other instanceof Timestamps) { - return other == Timestamps.HOUR; -} else if (other instanceof Hours -|| other instanceof Days -|| other instanceof Months -|| other instanceof Years) { - return true; + return ((Timestamps) other).getResultTypeUnit() == ChronoUnit.HOURS Review Comment: Fixed in 3b87435b3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1382166328 ## api/src/main/java/org/apache/iceberg/transforms/Months.java: ## @@ -55,14 +57,13 @@ public boolean satisfiesOrderOf(Transform other) { } if (other instanceof Timestamps) { - return Timestamps.MONTH.satisfiesOrderOf(other); + return ((Timestamps) other).getResultTypeUnit() == ChronoUnit.MONTHS Review Comment: Fixed in 3b87435b3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958 ## api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java: ## @@ -121,17 +121,13 @@ static R visit(Schema schema, PartitionField field, PartitionSpecVisitor } else if (transform instanceof Truncate) { int width = ((Truncate) transform).width(); return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), width); -} else if (transform == Dates.YEAR -|| transform == Timestamps.YEAR -|| transform instanceof Years) { +} else if ("year".equalsIgnoreCase(transform.toString())) { Review Comment: You're right. Fixed in 09c1f2534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958 ## api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java: ## @@ -121,17 +121,13 @@ static R visit(Schema schema, PartitionField field, PartitionSpecVisitor } else if (transform instanceof Truncate) { int width = ((Truncate) transform).width(); return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), width); -} else if (transform == Dates.YEAR -|| transform == Timestamps.YEAR -|| transform instanceof Years) { +} else if ("year".equalsIgnoreCase(transform.toString())) { Review Comment: You're right. Fixed in 09c1f2534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958 ## api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java: ## @@ -121,17 +121,13 @@ static R visit(Schema schema, PartitionField field, PartitionSpecVisitor } else if (transform instanceof Truncate) { int width = ((Truncate) transform).width(); return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), width); -} else if (transform == Dates.YEAR -|| transform == Timestamps.YEAR -|| transform instanceof Years) { +} else if ("year".equalsIgnoreCase(transform.toString())) { Review Comment: You're right. Fixed in [09c1f2534](https://github.com/apache/iceberg/pull/8971/commits/09c1f253424077e93c668fc913a9e645c0607568) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1382170673 ## api/src/main/java/org/apache/iceberg/types/Types.java: ## @@ -205,27 +208,56 @@ public String toString() { } public static class TimestampType extends PrimitiveType { -private static final TimestampType INSTANCE_WITH_ZONE = new TimestampType(true); -private static final TimestampType INSTANCE_WITHOUT_ZONE = new TimestampType(false); + +private static final TimestampType INSTANCE_MICROS_WITH_ZONE = +new TimestampType(true, ChronoUnit.MICROS); +private static final TimestampType INSTANCE_MICROS_WITHOUT_ZONE = +new TimestampType(false, ChronoUnit.MICROS); +private static final TimestampType INSTANCE_NANOS_WITH_ZONE = +new TimestampType(true, ChronoUnit.NANOS); +private static final TimestampType INSTANCE_NANOS_WITHOUT_ZONE = +new TimestampType(false, ChronoUnit.NANOS); public static TimestampType withZone() { Review Comment: Fixed in 5f30948fa -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1382167692 ## api/src/main/java/org/apache/iceberg/transforms/Transforms.java: ## @@ -129,10 +131,14 @@ public static Transform year(Type type) { case DATE: return (Transform) Dates.YEAR; case TIMESTAMP: -return (Transform) Timestamps.YEAR; - default: -throw new IllegalArgumentException("Cannot partition type " + type + " by year"); +switch (((TimestampType) type).unit()) { + case MICROS: +return (Transform) Timestamps.YEAR_FROM_MICROS; + case NANOS: +return (Transform) Timestamps.YEAR_FROM_NANOS; +} } +throw new IllegalArgumentException("Cannot partition type " + type + " by year"); Review Comment: I agree. Fixed in 74b90d9a5 As I fixed this, I noticed that exceptions thrown for "unsupported timestamp unit" were inconsistent, so I cleaned that up in e0f6d3b1f -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]
jacobmarble commented on code in PR #8971: URL: https://github.com/apache/iceberg/pull/8971#discussion_r1384147958 ## api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java: ## @@ -121,17 +121,13 @@ static R visit(Schema schema, PartitionField field, PartitionSpecVisitor } else if (transform instanceof Truncate) { int width = ((Truncate) transform).width(); return visitor.truncate(field.fieldId(), sourceName, field.sourceId(), width); -} else if (transform == Dates.YEAR -|| transform == Timestamps.YEAR -|| transform instanceof Years) { +} else if ("year".equalsIgnoreCase(transform.toString())) { Review Comment: You're right. Fixed in 1e374c62e -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
dimas-b commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1384165960 ## core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java: ## @@ -400,8 +400,15 @@ public void replaceTableViaTransactionThatAlreadyExistsAsView() { .buildTable(viewIdentifier, SCHEMA) .replaceTransaction() .commitTransaction()) -.isInstanceOf(NoSuchTableException.class) -.hasMessageStartingWith("Table does not exist: ns.view"); +.satisfiesAnyOf( +throwable -> +assertThat(throwable) +.isInstanceOf(NoSuchTableException.class) +.hasMessageStartingWith("Table does not exist: ns.view"), +throwable -> +assertThat(throwable) Review Comment: Can we follow the in-memory catalog pattern in Nessie? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
dimas-b commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1384188720 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java: ## @@ -135,71 +135,26 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) { boolean newTable = base == null; String newMetadataLocation = writeNewMetadataIfRequired(newTable, metadata); -String refName = client.refName(); -boolean failure = false; +AtomicBoolean failure = new AtomicBoolean(false); try { String contentId = table == null ? null : table.getId(); client.commitTable(base, metadata, newMetadataLocation, contentId, key); -} catch (NessieConflictException ex) { - failure = true; - if (ex instanceof NessieReferenceConflictException) { -// Throws a specialized exception, if possible -maybeThrowSpecializedException((NessieReferenceConflictException) ex); +} catch (NessieConflictException | NessieNotFoundException | HttpClientException ex) { + NessieUtil.handleExceptionsForCommits(ex, client.refName(), failure); +} catch (NessieBadRequestException ex) { Review Comment: I think it's a good idea to record the current commit hash for "new" table/view commits at the time `NessieTableOperations` is created. At the same time, if the commit is an update, I think we should use the hash from the metadata (basically track it for views too). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark: inconsistency in rewrite data and summary [iceberg]
github-actions[bot] commented on issue #7463: URL: https://github.com/apache/iceberg/issues/7463#issuecomment-1797064259 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Nessie: Support views for NessieCatalog [iceberg]
dimas-b commented on code in PR #8909: URL: https://github.com/apache/iceberg/pull/8909#discussion_r1384194329 ## nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java: ## @@ -135,71 +135,26 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) { boolean newTable = base == null; String newMetadataLocation = writeNewMetadataIfRequired(newTable, metadata); -String refName = client.refName(); -boolean failure = false; +AtomicBoolean failure = new AtomicBoolean(false); try { String contentId = table == null ? null : table.getId(); client.commitTable(base, metadata, newMetadataLocation, contentId, key); -} catch (NessieConflictException ex) { - failure = true; - if (ex instanceof NessieReferenceConflictException) { -// Throws a specialized exception, if possible -maybeThrowSpecializedException((NessieReferenceConflictException) ex); +} catch (NessieConflictException | NessieNotFoundException | HttpClientException ex) { + NessieUtil.handleExceptionsForCommits(ex, client.refName(), failure); +} catch (NessieBadRequestException ex) { Review Comment: Re: `NessieBadRequestException`, I agree that we should not try to "handle" it. Instead the catalog should be implemented such that `NessieBadRequestException` does not happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Parquet: don't throw exception on row group filters when reading INT96 column [iceberg]
manuzhang commented on code in PR #8988: URL: https://github.com/apache/iceberg/pull/8988#discussion_r1384231086 ## parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java: ## @@ -199,7 +201,7 @@ public Boolean lt(BoundReference ref, Literal lit) { int id = ref.fieldId(); Boolean hasNonDictPage = isFallback.get(id); - if (hasNonDictPage == null || hasNonDictPage) { + if (hasNonDictPage == null || hasNonDictPage || isInt96Column(id)) { Review Comment: Indeed, I've updated the patch. Please review again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Clarify which columns can be used for equality delete files. [iceberg]
liurenjie1024 commented on code in PR #8981: URL: https://github.com/apache/iceberg/pull/8981#discussion_r1384272688 ## format/spec.md: ## @@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` then `pos` to optimize Equality delete files identify deleted rows in a collection of data files by one or more column values, and may optionally contain additional columns of the deleted row. -Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). Float and double columns cannot be used as delete columns in equality delete files. +Equality delete files store any subset of a table's columns and use the table's field ids. The _delete columns_ are the columns of the delete file used to match data rows. Delete columns are identified by id in the delete file [metadata column `equality_ids`](#manifests). The column restrictions for columns used in equality delete files are the same as those for [identifier fields](#identifier-field-ids) with the exception that optional columns and columns nested under optional structs are allowed (if a +parent struct column is null it implies the leaf column is null). Review Comment: Sorry I missed the paragraph later. I think the statement is clear enough that a `NULL` values means `is not NULL`. I think it's reasonable to have that definition in equality check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark:CALL [rewrite_manifests] error Manifest is missing [iceberg]
372242283 commented on issue #4161: URL: https://github.com/apache/iceberg/issues/4161#issuecomment-1797425869 Spark:3.3 Iceberg:13.0 Encountering the same problem I also have this problem. I use the iceberg table of hive Catalog, and the operation is rewrite_data_file-> rewrite_manifest-> expire_snapshot, but the problem of “Manifest is missing..." occasionally occurs in the operation of rewrite_manifest. How do you solve the problem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
nk1506 commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1384338591 ## hive-metastore/src/test/java/org/apache/iceberg/hive/HiveMetastoreTest.java: ## @@ -31,6 +31,10 @@ import org.junit.jupiter.api.AfterAll; import org.junit.jupiter.api.BeforeAll; +/* + * This meta-setup has been deprecated; use {@link HiveMetastoreExtension} instead. + * */ +@Deprecated Review Comment: Thanks @pvary for reviewing this . Yes we have another change is in progress with all the tests related cleanup. WIP [patch](https://github.com/nk1506/iceberg/pull/2) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]
nk1506 commented on code in PR #8918: URL: https://github.com/apache/iceberg/pull/8918#discussion_r1384340506 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -500,6 +511,9 @@ protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) { return String.format("%s/%s", databaseData.getLocationUri(), tableIdentifier.name()); } +} catch (NoSuchObjectException e) { + throw new NoSuchNamespaceException( + e, "Namespace does not exist: %s", tableIdentifier.namespace().levels()[0]); Review Comment: I think we run Hive-CI with hive2 and hive3 both. Also I have validated manually for both versions. Both [Hive2](https://github.com/apache/hive/blob/release-2.3.9-rc0/metastore/src/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L137) and [Hive3](https://github.com/apache/hive/blob/release-3.1.3-rc3/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L143) have the same error message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Flink write iceberg bug(org.apache.iceberg.exceptions.NotFoundException) [iceberg]
lirui-apache commented on issue #5846: URL: https://github.com/apache/iceberg/issues/5846#issuecomment-1797782929 Sure, I'll open a PR for it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Ability to the write Metadata JSON [iceberg-python]
vrd83 commented on issue #22: URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1797875405 Guys, is this a prerequisite for altering the [write-ordered-by ](https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--write-ordered-by) on a table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Remove properties from `JdbcUtil` [iceberg]
nastra closed issue #8989: Remove properties from `JdbcUtil` URL: https://github.com/apache/iceberg/issues/8989 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] De-dup props in JdbcUtil [iceberg]
nastra merged PR #8992: URL: https://github.com/apache/iceberg/pull/8992 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Ability to the write Metadata JSON [iceberg-python]
HonahX commented on issue #22: URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1797920066 Hi @vrd83. It depends on which catalog you want to use to alter the table. For the RestCatalog, this is not a prerequisite. To enable altering the write order, we can implement a `ReplaceSortOrder` ( `UpdateSortOrder`) in a similar manner to how we currently support UpdateSchema. You can see how` UpdateSchema` is implemented here for reference: https://github.com/apache/iceberg-python/blob/03fa9f0b6a86fc13d855b24ce92e07b145faa500/pyiceberg/table/__init__.py#L1314-L1319 https://github.com/apache/iceberg-python/blob/03fa9f0b6a86fc13d855b24ce92e07b145faa500/pyiceberg/table/__init__.py#L264-L270 For other catalogs, such as Glue, DynamoDB, SQL, and Hive, this will be a prerequisite. We need the `ReplaceSortOrder` thing and the three bullet points listed by @Fokko above. @Fokko, please correct me if I've missed anything about the RestCatalog part -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Ability to the write Metadata JSON [iceberg-python]
HonahX commented on issue #22: URL: https://github.com/apache/iceberg-python/issues/22#issuecomment-1797949558 @Fokko Thanks for the explanation! > Ability to write the JSON to the object store (that was the intent of this PR). I think we already support this:https://github.com/apache/iceberg-python/blob/8e8d39dacde067773d6d840b9bf65070399957a9/pyiceberg/catalog/__init__.py#L571-L572 Could you please elaborate on this a bit? Are there additional features we are looking to implement beyond the `_write_metadata` method? > Have logic to update the metadata dictionary as you pointed out above. I think we can do this per operation (update schema, update partition-spec, update sort-order, etc) to keep it small and we can get it in quickly. I plan to start with `update_schema`, `set_snapshot_ref`, and `add_snapshot`, given that update_schema is already supported and the other two operations are pivotal for write support. I will try to make a draft PR soon for further discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org