Re: [I] Discussion: Replace usage of `Builder` with `TypedBuilder` [iceberg-rust]
xiaoyang-sde commented on issue #88: URL: https://github.com/apache/iceberg-rust/issues/88#issuecomment-1831394325 > Hi, @xiaoyang-sde Welcome to contribute! This seems reasonable to me. Thanks! Could you please assign this issue to me? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Discussion: Replace usage of `Builder` with `TypedBuilder` [iceberg-rust]
liurenjie1024 commented on issue #88: URL: https://github.com/apache/iceberg-rust/issues/88#issuecomment-1831404891 > > Hi, @xiaoyang-sde Welcome to contribute! This seems reasonable to me. > > Thanks! Could you please assign this issue to me? Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Parquet: Add a table property to control the Parquet row-group size of position delete files [iceberg]
zhongyujiang commented on PR #9177: URL: https://github.com/apache/iceberg/pull/9177#issuecomment-1831405339 The [CI failure](https://github.com/apache/iceberg/actions/runs/7029032258/job/19125970118#step:6:214) seems unreleated: ``` > Task :iceberg-flink:iceberg-flink-1.17:test TestIcebergSourceWithWatermarkExtractor > testThrottling FAILED java.lang.AssertionError: Expecting to be completed within 2M. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Duplicate file name in Iceberg's metadata [iceberg]
github-raphael-douyere commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1831463021 @Fokko I don't know how to have a simple and reproductible setup. We had the issue at a rate of ~10 files per week with an app producing hundreds of files per hour. @amogh-jahagirdar And yes I know that the file name is not only the query id. But I think the other elements can definitively repeat (`taskId` and `partitionId`). What I'm not sure of is the `fileCount` part. I think it is kept in memory but resets when the app is restarted (ie: not part of the state). So my point is: with a UUID this can't happen (barring the UUID collision) as whatever collisions on the other part of the filename are handled by a uniq part. Another fix could be to keep the `operationId` but add an UUID as well. This would extend the file names a little bit but is probably fine to avoid data loss issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Does iceberg flink streaming read support recover from last ckp? [iceberg]
qianzhen0 commented on issue #9175: URL: https://github.com/apache/iceberg/issues/9175#issuecomment-1831508633 @pvary thanks for the input! by saying `IcebergSource`, do you mean iceberg flink connector? or spark engine? So, if i run `insert into sink select * from iceberg_table` in streaming mode (with ckp interval configured), it should recover reading from last ckp. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] An exception occurred while writing iceberg data through Spark: org. apache. iceberg. exceptions. CommitFailedException: metadata location has changed [iceberg]
nk1506 commented on issue #9178: URL: https://github.com/apache/iceberg/issues/9178#issuecomment-1831614906 it seems concurrent commit issues. Is there by an chance you are running these queries in parallel? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] DataFile creation by file path seems wrong [iceberg]
gaborkaszab closed issue #7612: DataFile creation by file path seems wrong URL: https://github.com/apache/iceberg/issues/7612 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] DataFile creation by file path seems wrong [iceberg]
gaborkaszab commented on issue #7612: URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1831734715 https://github.com/apache/iceberg/pull/7744 solves this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Spark3 DSv2: Handle Snapshots of Type OVERWRITE - while reading Stream from Iceberg table [iceberg]
imonteroq commented on issue #2788: URL: https://github.com/apache/iceberg/issues/2788#issuecomment-1831736292 Hi @SreeramGarlapati , @tmnd1991 are there any plans to implement this. This is a selling point for Hudi over Iceberg. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409243044 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java: ## @@ -264,6 +265,163 @@ public void renameTable(TableIdentifier from, TableIdentifier originalTo) { } } + @Override + public boolean viewExists(TableIdentifier identifier) { +return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, TableType.VIRTUAL_VIEW); + } + + @Override + public boolean dropView(TableIdentifier identifier) { Review Comment: I am asking, because for the `dropTable` we have a `purge` flag which drives if the metadata is removed, or not... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409251564 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -162,8 +163,11 @@ protected void doRefresh() { Thread.currentThread().interrupt(); throw new RuntimeException("Interrupted during refresh", e); } - -refreshFromMetadataLocation(metadataLocation, metadataRefreshMaxRetries); +if (table != null && table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { + disableRefresh(); Review Comment: I am curious, do you know why we do not refresh? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409253203 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -250,14 +262,15 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) { + "iceberg.hive.lock-heartbeat-interval-ms.", le); } catch (org.apache.hadoop.hive.metastore.api.AlreadyExistsException e) { -throw new AlreadyExistsException(e, "Table already exists: %s.%s", database, tableName); - +throw new org.apache.iceberg.exceptions.AlreadyExistsException( Review Comment: Question: Did you intentionally used fully qualified name here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409256826 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java: ## @@ -181,4 +186,30 @@ default Table newHmsTable(String hmsTableOwner) { return newTable; } + + default void setHmsParameters( + Table tbl, + String newMetadataLocation, + Schema schema, + String uuid, + Supplier previousLocationSupplier) { +Map parameters = +Optional.ofNullable(tbl.getParameters()).orElseGet(Maps::newHashMap); + +if (uuid != null) { + parameters.put(TableProperties.UUID, uuid); +} Review Comment: nit: newline ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java: ## @@ -181,4 +186,30 @@ default Table newHmsTable(String hmsTableOwner) { return newTable; } + + default void setHmsParameters( + Table tbl, + String newMetadataLocation, + Schema schema, + String uuid, + Supplier previousLocationSupplier) { +Map parameters = +Optional.ofNullable(tbl.getParameters()).orElseGet(Maps::newHashMap); + +if (uuid != null) { + parameters.put(TableProperties.UUID, uuid); +} +parameters.put(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, newMetadataLocation); +parameters.put( +BaseMetastoreTableOperations.TABLE_TYPE_PROP, + BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH)); + +if (previousLocationSupplier.get() != null && !previousLocationSupplier.get().isEmpty()) { + parameters.put( + BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP, + previousLocationSupplier.get()); +} Review Comment: nit: newline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409258375 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -328,6 +346,14 @@ private void setHmsTableParameters( Set obsoleteProps, boolean hiveEngineEnabled, Map summary) { + +setHmsParameters( Review Comment: Wouldn't the different ordering of this and the metadata translation cause changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409261131 ## .palantir/revapi.yml: ## @@ -877,6 +877,27 @@ acceptedBreaks: - code: "java.field.serialVersionUIDChanged" new: "field org.apache.iceberg.util.SerializableMap.serialVersionUID" justification: "Serialization is not be used" +- code: "java.method.nowStatic" + old: "method org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\ +\ org.apache.iceberg.TableMetadata)" + new: "method org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\ +\ java.lang.String, java.util.Map, java.util.function.Function)" + justification: "Increased visibility and refactor methods for view operation" +- code: "java.method.numberOfParametersChanged" + old: "method org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\ +\ org.apache.iceberg.TableMetadata)" + new: "method org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\ +\ java.lang.String, java.util.Map, java.util.function.Function)" + justification: "Increased visibility and refactor methods for view operation" +- code: "java.method.visibilityIncreased" Review Comment: This is probably acceptable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
pvary commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409262085 ## .palantir/revapi.yml: ## @@ -877,6 +877,27 @@ acceptedBreaks: - code: "java.field.serialVersionUIDChanged" new: "field org.apache.iceberg.util.SerializableMap.serialVersionUID" justification: "Serialization is not be used" +- code: "java.method.nowStatic" + old: "method org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\ +\ org.apache.iceberg.TableMetadata)" + new: "method org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\ +\ java.lang.String, java.util.Map, java.util.function.Function)" + justification: "Increased visibility and refactor methods for view operation" +- code: "java.method.numberOfParametersChanged" Review Comment: I am not sure about this. Do we need a deprecation, and default method for this to be backward compatible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Does iceberg flink streaming read support recover from last ckp? [iceberg]
pvary commented on issue #9175: URL: https://github.com/apache/iceberg/issues/9175#issuecomment-1831887127 `IcebergSource` is using the new Flink Source API - this is the newer one `FlinkSource` is using the old Flink SourceFunction API - this will be removed (I plan to deprecate it in the new Iceberg release) When using the table API the `table.exec.iceberg.use-flip27-source` config option could be used to chose between the 2 versions. See: https://iceberg.apache.org/docs/1.3.1/flink-queries/#flip-27-source-for-sql -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Table scan using functional filters [iceberg-python]
bigluck opened a new issue, #170: URL: https://github.com/apache/iceberg-python/issues/170 ### Feature Request / Improvement Ciao @Fokko Seems like `table.scan()` supports a limited set of `filter` conditions, and it fails when a user specifies a complex one. In my case, I have this input query: ```sql SELECT * FROM wind_energy_sensor_data where observed_at::date = '2020-03-03' ``` Once mapped into a list of fields and filters using sqlglot, I get: ``` { "filter": "CAST(observed_at AS DATE) = '2020-03-03'", "name": "wind_energy_sensor_data", "projections": [ "*", "observed_at" ] } ``` But then when I pass the filter to the `table.scan()` function, it raises this `ParseException`: ```python scan = table.scan( ^^^ File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 473, in scan return DataScan( ^ File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 773, in __init__ super().__init__(table, row_filter, selected_fields, case_sensitive, snapshot_id, options, limit) File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 629, in __init__ self.row_filter = _parse_row_filter(row_filter) ^ File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 603, in _parse_row_filter return parser.parse(expr) if isinstance(expr, str) else expr ^^ File "/pip/pyiceberg/pyiceberg/expressions/parser.py", line 270, in parse return boolean_expression.parse_string(expr, parse_all=True)[0] ^ File "/pip/pyparsing/pyparsing/core.py", line 1197, in parse_string raise exc.with_traceback(None) pyparsing.exceptions.ParseException: Expected expr, found '(' (at char 4), (line:1, col:5) ``` How hard is it to extend support for functional filters (as CAST() in my case)? And more importantly, is it something that makes sense to have, or pyiceberg (and in general iceberg) expect these types of conditions in a different format? thanks, Luca -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump cryptography from 41.0.4 to 41.0.6 [iceberg-python]
Fokko merged PR #169: URL: https://github.com/apache/iceberg-python/pull/169 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update table metadata [iceberg-python]
Fokko commented on code in PR #139: URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1409338251 ## pyiceberg/table/__init__.py: ## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: +updates: List[TableUpdate] +last_added_schema_id: Optional[int] + +def __init__(self) -> None: +self.updates = [] +self.last_added_schema_id = None + +def is_added_snapshot(self, snapshot_id: int) -> bool: +return any( +update.snapshot.snapshot_id == snapshot_id +for update in self.updates +if update.action == TableUpdateAction.add_snapshot +) + +def is_added_schema(self, schema_id: int) -> bool: +return any( +update.schema_.schema_id == schema_id for update in self.updates if update.action == TableUpdateAction.add_schema +) + + +@singledispatch +def apply_table_update(update: TableUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +"""Apply a table update to the table metadata. + +Args: +update: The update to be applied. +base_metadata: The base metadata to be updated. +context: Contains previous updates, last_added_snapshot_id and other change tracking information in the current transaction. + +Returns: +The updated metadata. + +""" +raise NotImplementedError(f"Unsupported table update: {update}") + + +@apply_table_update.register(UpgradeFormatVersionUpdate) +def _(update: UpgradeFormatVersionUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +if update.format_version > SUPPORTED_TABLE_FORMAT_VERSION: +raise ValueError(f"Unsupported table format version: {update.format_version}") + +if update.format_version < base_metadata.format_version: +raise ValueError(f"Cannot downgrade v{base_metadata.format_version} table to v{update.format_version}") + +if update.format_version == base_metadata.format_version: +return base_metadata + +updated_metadata_data = copy(base_metadata.model_dump()) +updated_metadata_data["format-version"] = update.format_version Review Comment: In that case, we can be cautious and set `deep=true`. I would love to see some tests that validate the behavior. Those should be easy to add. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update table metadata [iceberg-python]
Fokko commented on code in PR #139: URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1409336199 ## pyiceberg/table/__init__.py: ## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: +updates: List[TableUpdate] +last_added_schema_id: Optional[int] Review Comment: I prefer to keep them separate. I think we might need to have some additional checks here such as, what happens if you add a column, and then revoke the column again. It will first create a new schema, with a new ID, and then it will reuse the old schema again. - 1: Schema(a: int), current_schema_id=1 Add column b: - 1: Schema(a: int), 2: Schema(a: int, b: int), current_schema_id=2 Drop column b: - 1: Schema(a: int), 2: Schema(a: int, b: int), current_schema_id=1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Update table metadata [iceberg-python]
Fokko commented on code in PR #139: URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1409362760 ## pyiceberg/table/__init__.py: ## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: +updates: List[TableUpdate] +last_added_schema_id: Optional[int] + +def __init__(self) -> None: +self.updates = [] +self.last_added_schema_id = None + +def is_added_snapshot(self, snapshot_id: int) -> bool: +return any( +update.snapshot.snapshot_id == snapshot_id +for update in self.updates +if update.action == TableUpdateAction.add_snapshot +) + +def is_added_schema(self, schema_id: int) -> bool: +return any( +update.schema_.schema_id == schema_id for update in self.updates if update.action == TableUpdateAction.add_schema +) + + +@singledispatch +def apply_table_update(update: TableUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +"""Apply a table update to the table metadata. + +Args: +update: The update to be applied. +base_metadata: The base metadata to be updated. +context: Contains previous updates, last_added_snapshot_id and other change tracking information in the current transaction. + +Returns: +The updated metadata. + +""" +raise NotImplementedError(f"Unsupported table update: {update}") + + +@apply_table_update.register(UpgradeFormatVersionUpdate) +def _(update: UpgradeFormatVersionUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +if update.format_version > SUPPORTED_TABLE_FORMAT_VERSION: +raise ValueError(f"Unsupported table format version: {update.format_version}") + +if update.format_version < base_metadata.format_version: +raise ValueError(f"Cannot downgrade v{base_metadata.format_version} table to v{update.format_version}") + +if update.format_version == base_metadata.format_version: +return base_metadata + +updated_metadata_data = copy(base_metadata.model_dump()) +updated_metadata_data["format-version"] = update.format_version + +context.updates.append(update) +return TableMetadataUtil.parse_obj(updated_metadata_data) + + +@apply_table_update.register(AddSchemaUpdate) +def _(update: AddSchemaUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +def reuse_or_create_new_schema_id(new_schema: Schema) -> Tuple[int, bool]: +"""Reuse schema id if schema already exists, otherwise create a new one. + +Args: +new_schema: The new schema to be added. + +Returns: +The new schema id and whether the schema already exists. +""" +result_schema_id = base_metadata.current_schema_id +for schema in base_metadata.schemas: +if schema == new_schema: +return schema.schema_id, True +elif schema.schema_id >= result_schema_id: +result_schema_id = schema.schema_id + 1 +return result_schema_id, False + +if update.last_column_id < base_metadata.last_column_id: +raise ValueError(f"Invalid last column id {update.last_column_id}, must be >= {base_metadata.last_column_id}") + +new_schema_id, schema_found = reuse_or_create_new_schema_id(update.schema_) +if schema_found and update.last_column_id == base_metadata.last_column_id: +if context.last_added_schema_id is not None and context.is_added_schema(new_schema_id): +context.last_added_schema_id = new_schema_id +return base_metadata + +updated_metadata_data = copy(base_metadata.model_dump()) +updated_metadata_data["last-column-id"] = update.last_column_id + +new_schema = ( +update.schema_ +if new_schema_id == update.schema_.schema_id +else Schema(*update.schema_.fields, schema_id=new_schema_id, identifier_field_ids=update.schema_.identifier_field_ids) +) + +if not schema_found: +updated_metadata_data["schemas"].append(new_schema.model_dump()) + +context.updates.append(update) +context.last_added_schema_id = new_schema_id +return TableMetadataUtil.parse_obj(updated_metadata_data) + + +@apply_table_update.register(SetCurrentSchemaUpdate) +def _(update: SetCurrentSchemaUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +if update.schema_id == -1: +if context.last_added_schema_id is None: +raise ValueError("Cannot set current schema to last added schema when no schema has been added") +return apply_table_update(SetCurrentSchemaUpdate(schema_id=context.last_added_schema_id), base_metadata, context) + +if update.schema_id == base_metadata.current_schema_id: +return base_metadata + +schema = base_m
Re: [PR] Update table metadata [iceberg-python]
Fokko commented on code in PR #139: URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1409365694 ## pyiceberg/table/__init__.py: ## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: +updates: List[TableUpdate] +last_added_schema_id: Optional[int] + +def __init__(self) -> None: +self.updates = [] +self.last_added_schema_id = None + +def is_added_snapshot(self, snapshot_id: int) -> bool: +return any( +update.snapshot.snapshot_id == snapshot_id +for update in self.updates +if update.action == TableUpdateAction.add_snapshot +) + +def is_added_schema(self, schema_id: int) -> bool: +return any( +update.schema_.schema_id == schema_id for update in self.updates if update.action == TableUpdateAction.add_schema +) + + +@singledispatch +def apply_table_update(update: TableUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +"""Apply a table update to the table metadata. + +Args: +update: The update to be applied. +base_metadata: The base metadata to be updated. +context: Contains previous updates, last_added_snapshot_id and other change tracking information in the current transaction. + +Returns: +The updated metadata. + +""" +raise NotImplementedError(f"Unsupported table update: {update}") + + +@apply_table_update.register(UpgradeFormatVersionUpdate) +def _(update: UpgradeFormatVersionUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +if update.format_version > SUPPORTED_TABLE_FORMAT_VERSION: +raise ValueError(f"Unsupported table format version: {update.format_version}") + +if update.format_version < base_metadata.format_version: +raise ValueError(f"Cannot downgrade v{base_metadata.format_version} table to v{update.format_version}") + +if update.format_version == base_metadata.format_version: +return base_metadata + +updated_metadata_data = copy(base_metadata.model_dump()) +updated_metadata_data["format-version"] = update.format_version + +context.updates.append(update) +return TableMetadataUtil.parse_obj(updated_metadata_data) + + +@apply_table_update.register(AddSchemaUpdate) +def _(update: AddSchemaUpdate, base_metadata: TableMetadata, context: TableMetadataUpdateContext) -> TableMetadata: +def reuse_or_create_new_schema_id(new_schema: Schema) -> Tuple[int, bool]: +"""Reuse schema id if schema already exists, otherwise create a new one. + +Args: +new_schema: The new schema to be added. + +Returns: +The new schema id and whether the schema already exists. +""" +result_schema_id = base_metadata.current_schema_id +for schema in base_metadata.schemas: +if schema == new_schema: +return schema.schema_id, True +elif schema.schema_id >= result_schema_id: +result_schema_id = schema.schema_id + 1 +return result_schema_id, False + +if update.last_column_id < base_metadata.last_column_id: +raise ValueError(f"Invalid last column id {update.last_column_id}, must be >= {base_metadata.last_column_id}") + +new_schema_id, schema_found = reuse_or_create_new_schema_id(update.schema_) +if schema_found and update.last_column_id == base_metadata.last_column_id: +if context.last_added_schema_id is not None and context.is_added_schema(new_schema_id): +context.last_added_schema_id = new_schema_id +return base_metadata + +updated_metadata_data = copy(base_metadata.model_dump()) +updated_metadata_data["last-column-id"] = update.last_column_id + +new_schema = ( +update.schema_ +if new_schema_id == update.schema_.schema_id +else Schema(*update.schema_.fields, schema_id=new_schema_id, identifier_field_ids=update.schema_.identifier_field_ids) +) Review Comment: > Thanks for the explanation! Just to confirm my understanding, in pyiceberg, one transaction should only have one AddSchemaUpdate because all updates related to the schema should be accumulated via UpdateSchema. This is correct. Currently, the `Update` has to be unique, so stacking two updates will even raise a `ValueError`. > Hence, we can trust the schema_id in the update and do not need the logic to handle multiple schema added in one transaction. This is correct 👍 ## pyiceberg/table/__init__.py: ## @@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate): removals: List[str] +class TableMetadataUpdateContext: +updates: List[TableUpdate] +last_added_schema_id: Optional[int] + +def __init__(self) -> None: +self.updates = [] +self.last_added_schema
Re: [I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]
amitmittal5 commented on issue #9172: URL: https://github.com/apache/iceberg/issues/9172#issuecomment-1832293641 Just adding more details on the table to which the job is writting, I tried with both external and managed table, but see similar results. The behavior I see, that it took around 10 execution of streaming job (with AvailableNow trigger) to reproduce the error for managed table. ``` CREATE EXTERNAL TABLE IF NOT EXISTS default.blob_iceberg (id string, state string, name string) USING ICEBERG LOCATION 'abfss://@.dfs.core.windows.net/test/blob_iceberg' ``` And managed table ``` CREATE TABLE IF NOT EXISTS default.blob_iceberg1 (id string, state string, name string) USING ICEBERG ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump org.apache.httpcomponents.client5:httpclient5 from 5.2.1 to 5.2.2 [iceberg]
dependabot[bot] commented on PR #9157: URL: https://github.com/apache/iceberg/pull/9157#issuecomment-1832326624 Sorry, only users with push access can use that command. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Build: Bump org.apache.httpcomponents.client5:httpclient5 from 5.2.1 to 5.2.2 [iceberg]
XN137 commented on PR #9157: URL: https://github.com/apache/iceberg/pull/9157#issuecomment-1832326554 @dependabot rebase -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
RussellSpitzer commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1409617517 ## api/src/main/java/org/apache/iceberg/types/TypeUtil.java: ## @@ -452,6 +454,59 @@ private static void checkSchemaCompatibility( } } + public static long defaultSize(Types.NestedField field) { Review Comment: Why do we need to cache the extra columns? We wouldn't be using them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Add Description on Using a Separate Authorization Server [iceberg]
mrcnc commented on code in PR #8998: URL: https://github.com/apache/iceberg/pull/8998#discussion_r1409630922 ## open-api/rest-catalog-open-api.yaml: ## @@ -2872,6 +2872,10 @@ components: For unauthorized requests, services should return an appropriate 401 or 403 response. Implementations must not return altered success (200) responses when a request is unauthenticated or unauthorized. + +If a separate authorization server is used, subtitute the tokenUrl with Review Comment: ```suggestion If a separate authorization server is used, substitute the tokenUrl with ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Document watermark generation feature [iceberg]
stevenzwu commented on code in PR #9179: URL: https://github.com/apache/iceberg/pull/9179#discussion_r1409630306 ## docs/flink-queries.md: ## @@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, WatermarkStrategy.noWatermarks() "Iceberg Source as Avro GenericRecord", new GenericRecordAvroTypeInfo(avroSchema)); ``` +### Emitting watermarks +Emitting watermarks from the source itself could be beneficial for several purposes, like harnessing the +[Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment) +feature to prevent runaway readers, or providing triggers for [Flink windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/). Review Comment: I don't think we should call out `windows` specifically for the benefit of emitting watermark from source itself. any event time and watermark strategy will have windows triggers ## docs/flink-queries.md: ## @@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, WatermarkStrategy.noWatermarks() "Iceberg Source as Avro GenericRecord", new GenericRecordAvroTypeInfo(avroSchema)); ``` +### Emitting watermarks +Emitting watermarks from the source itself could be beneficial for several purposes, like harnessing the +[Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment) +feature to prevent runaway readers, or providing triggers for [Flink windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/). + +Enable watermark generation for an `IcebergSource` by setting the `watermarkColumn`. +The supported column types are `timestamp`, `timestamptz` and `long`. +Timestamp columns are automatically converted to milliseconds since the Java epoch of +1970-01-01T00:00:00Z. Use `watermarkTimeUnit` to configure the conversion for long columns. + +The watermarks are generated based on column metrics stored for data files and emitted once per split. +When using watermarks for Flink watermark alignment set `read.split.open-file-cost` to prevent +combining multiple files to a single split. +By default, the column metrics are collected for the first 100 columns of the table. Use [write properties](configuration.md#write-properties) starting with `write.metadata.metrics` when needed. Review Comment: break off this line to a new paragraph. ## docs/flink-queries.md: ## @@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, WatermarkStrategy.noWatermarks() "Iceberg Source as Avro GenericRecord", new GenericRecordAvroTypeInfo(avroSchema)); ``` +### Emitting watermarks +Emitting watermarks from the source itself could be beneficial for several purposes, like harnessing the +[Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment) +feature to prevent runaway readers, or providing triggers for [Flink windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/). + +Enable watermark generation for an `IcebergSource` by setting the `watermarkColumn`. +The supported column types are `timestamp`, `timestamptz` and `long`. +Timestamp columns are automatically converted to milliseconds since the Java epoch of +1970-01-01T00:00:00Z. Use `watermarkTimeUnit` to configure the conversion for long columns. Review Comment: > Use `watermarkTimeUnit` to configure the conversion for long columns. Before this sentence, maybe elaborate a little more. ``` Iceberg `timestamp` or `timestamptz` inherently contains the time precision. So there is no need to specify the time unit. But `long` type column doesn't contain time unit information. Use `watermarkTimeUnit` to configure the conversion for long columns. ``` ## docs/flink-queries.md: ## @@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, WatermarkStrategy.noWatermarks() "Iceberg Source as Avro GenericRecord", new GenericRecordAvroTypeInfo(avroSchema)); ``` +### Emitting watermarks +Emitting watermarks from the source itself could be beneficial for several purposes, like harnessing the +[Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment) +feature to prevent runaway readers, or providing triggers for [Flink windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/). + +Enable watermark generation for an `IcebergSource` by setting the `watermarkColumn`. +The supported column types are `timestamp`, `timestamptz` and `long`. +Timestamp columns a
Re: [PR] Flink: Document watermark generation feature [iceberg]
stevenzwu commented on PR #9179: URL: https://github.com/apache/iceberg/pull/9179#issuecomment-1832386184 @dchristle can you also help review this doc PR? your perspective can help improve the readability of the doc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
RussellSpitzer commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1409668358 ## core/src/main/java/org/apache/iceberg/SystemConfigs.java: ## @@ -43,14 +43,14 @@ private SystemConfigs() {} Integer::parseUnsignedInt); /** - * Sets the size of the delete worker pool. This limits the number of threads used to compute the - * PositionDeleteIndex from the position deletes for a data file. + * Sets the size of the delete worker pool. This limits the number of threads used to read delete + * files for a data file. */ public static final ConfigEntry DELETE_WORKER_THREAD_POOL_SIZE = new ConfigEntry<>( "iceberg.worker.delete-num-threads", "ICEBERG_WORKER_DELETE_NUM_THREADS", - Math.max(2, Runtime.getRuntime().availableProcessors()), + Math.max(2, 4 * Runtime.getRuntime().availableProcessors()), Review Comment: Not a huge deal but we are avoiding the RevCheck here by putting our multiplier in a constant here. We should probably move the 4 into a field so future modifications trigger the Rev checker -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]
RussellSpitzer commented on code in PR #8755: URL: https://github.com/apache/iceberg/pull/8755#discussion_r1409670268 ## core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java: ## @@ -27,6 +27,15 @@ class BitmapPositionDeleteIndex implements PositionDeleteIndex { roaring64Bitmap = new Roaring64Bitmap(); } + void merge(PositionDeleteIndex other) { Review Comment: Why not just only allow BitmapPositionDeleteIndex here? Do we not have the type when we call merge? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Document watermark generation feature [iceberg]
mas-chen commented on code in PR #9179: URL: https://github.com/apache/iceberg/pull/9179#discussion_r1409704342 ## docs/flink-queries.md: ## @@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, WatermarkStrategy.noWatermarks() "Iceberg Source as Avro GenericRecord", new GenericRecordAvroTypeInfo(avroSchema)); ``` +### Emitting watermarks +Emitting watermarks from the source itself could be beneficial for several purposes, like harnessing the +[Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment) Review Comment: Should we use the stable documentation to avoid stale links? Ditto on the other references to Flink docs ```suggestion [Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment) ``` ## docs/flink-queries.md: ## @@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, WatermarkStrategy.noWatermarks() "Iceberg Source as Avro GenericRecord", new GenericRecordAvroTypeInfo(avroSchema)); ``` +### Emitting watermarks +Emitting watermarks from the source itself could be beneficial for several purposes, like harnessing the +[Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment) +feature to prevent runaway readers, or providing triggers for [Flink windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/). + +Enable watermark generation for an `IcebergSource` by setting the `watermarkColumn`. +The supported column types are `timestamp`, `timestamptz` and `long`. +Timestamp columns are automatically converted to milliseconds since the Java epoch of +1970-01-01T00:00:00Z. Use `watermarkTimeUnit` to configure the conversion for long columns. + +The watermarks are generated based on column metrics stored for data files and emitted once per split. +When using watermarks for Flink watermark alignment set `read.split.open-file-cost` to prevent +combining multiple files to a single split. +By default, the column metrics are collected for the first 100 columns of the table. Use [write properties](configuration.md#write-properties) starting with `write.metadata.metrics` when needed. + +```java +StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); + +// For windowing +DataStream stream = +env.fromSource( +IcebergSource.forRowData() +.tableLoader(tableLoader) +// Watermark using timestamp column +.watermarkColumn("timestamp_column") +.build(), +// Watermarks are generated by the source, no need to generate it manually +WatermarkStrategy.noWatermarks() +// Extract event timestamp from records +.withTimestampAssigner((record, eventTime) -> record.getTimestamp(pos, precision).getMillisecond()), +SOURCE_NAME, +TypeInformation.of(RowData.class)); + +// For watermark alignment +DataStream stream = +env.fromSource( +IcebergSource source = IcebergSource.forRowData() +.tableLoader(tableLoader) +// Disable combining multiple files to a single split +.set(FlinkReadOptions.SPLIT_FILE_OPEN_COST, String.valueOf(TableProperties.SPLIT_SIZE_DEFAULT)) +// Watermark using long column +.watermarkColumn("long_column") +.watermarkTimeUnit(TimeUnit.MILLI_SCALE) Review Comment: I'd include this in the previous example. I read this as a more advanced example as most users wouldn't need watermark alignment and so `withTimestampAssigner` could also be moved down here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Document watermark generation feature [iceberg]
mas-chen commented on code in PR #9179: URL: https://github.com/apache/iceberg/pull/9179#discussion_r1409722185 ## docs/flink-queries.md: ## @@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, WatermarkStrategy.noWatermarks() "Iceberg Source as Avro GenericRecord", new GenericRecordAvroTypeInfo(avroSchema)); ``` +### Emitting watermarks +Emitting watermarks from the source itself could be beneficial for several purposes, like harnessing the +[Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment) +feature to prevent runaway readers, or providing triggers for [Flink windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/). + +Enable watermark generation for an `IcebergSource` by setting the `watermarkColumn`. +The supported column types are `timestamp`, `timestamptz` and `long`. +Timestamp columns are automatically converted to milliseconds since the Java epoch of +1970-01-01T00:00:00Z. Use `watermarkTimeUnit` to configure the conversion for long columns. + +The watermarks are generated based on column metrics stored for data files and emitted once per split. +When using watermarks for Flink watermark alignment set `read.split.open-file-cost` to prevent +combining multiple files to a single split. Review Comment: Is there a tradeoff for increasing `read.split.open-file-cost`? Would a user need to also tune the split throttling parameter to avoid memory issues? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Flink: Fix IcebergSource tableloader lifecycle management in batch mode [iceberg]
mas-chen commented on code in PR #9173: URL: https://github.com/apache/iceberg/pull/9173#discussion_r1409747023 ## flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java: ## @@ -120,26 +122,30 @@ private String planningThreadName() { // a public API like the protected method "OperatorCoordinator.Context getCoordinatorContext()" // from SourceCoordinatorContext implementation. For now, - is used as // the unique thread pool name. -return lazyTable().name() + "-" + UUID.randomUUID(); +return tableName() + "-" + UUID.randomUUID(); } - private List planSplitsForBatch(String threadName) { + private List planSplitsForBatch( + TableLoader plannerTableLoader, String threadName) { ExecutorService workerPool = ThreadPools.newWorkerPool(threadName, scanContext.planParallelism()); -try { +plannerTableLoader.open(); Review Comment: I forgot to rename this variable earlier -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Spark: Fix aggregation pushdown on struct fields [iceberg]
Fokko commented on code in PR #9176: URL: https://github.com/apache/iceberg/pull/9176#discussion_r1409850325 ## api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java: ## @@ -30,13 +30,16 @@ protected ValueAggregate(Operation op, BoundTerm term) { @Override public T eval(StructLike struct) { -return term().eval(struct); +if (struct.size() > 1) { + throw new UnsupportedOperationException("Expected struct like of size 1"); Review Comment: ```suggestion throw new UnsupportedOperationException("Expected struct like of size smaller or equal to 1"); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Core: Add Comment field to ViewProperties [iceberg]
amogh-jahagirdar opened a new pull request, #9181: URL: https://github.com/apache/iceberg/pull/9181 We reference using "comment" as a property in the view spec, but it looks like we don't have a constant defined in the library. See https://github.com/trinodb/trino/pull/19818#discussion_r1408354022 for more details but I think "comment" should be defined in the Iceberg library as it is not engine specific. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Table scan using functional filters [iceberg-python]
Fokko commented on issue #170: URL: https://github.com/apache/iceberg-python/issues/170#issuecomment-1832717377 It looks like we need to add the BoundTransform to the left hand side: https://github.com/apache/iceberg/blob/d247b20f166ccb0b92443d4b05330b1e0d9c5d49/api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java#L231-L241 When the table is partitioned by `day(observed_at)`, then we can use it to do the optimization (the transform can be dropped from both sides). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[I] Core: Snapshot file are not correctly deleted when a snapshot is expired as part of a transaction [iceberg]
bartash opened a new issue, #9182: URL: https://github.com/apache/iceberg/issues/9182 ### Apache Iceberg version main (development) ### Query engine Impala ### Please describe the bug 🐞 When a snapshot is expired as part of a transaction, the snapshot file(s) should be deleted when the transaction commits. A recent change (PR #6634) ensures that files are not deleted when they have also been committed as part of a transaction, but this breaks the simple case where no new files are committed. I have a candidate fix which I will post in a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Core: Snapshot file are not correctly deleted when a snapshot is expired as part of a transaction [iceberg]
bartash commented on issue #9182: URL: https://github.com/apache/iceberg/issues/9182#issuecomment-1832727577 Can someone assign this to me please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]
bartash opened a new pull request, #9183: URL: https://github.com/apache/iceberg/pull/9183 When a snapshot is expired as part of a transaction, the snapshot file(s) should be deleted when the transaction commits. A recent change (#6634) ensured that files are not deleted when they have also been committed as part of a transaction, but this breaks the simple case where no new files are committed. Fix this by not skipping deletion when the list of committed files is empty. Closes #9182 TESTING: Extended a unit test to ensure that snapshot files are deleted. Ran the test without the fix on a branch where #6634 was reverted to show that this is a regression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]
bartash commented on PR #9183: URL: https://github.com/apache/iceberg/pull/9183#issuecomment-1832734372 @amogh-jahagirdar can you please take a look when you have time as you implemented #6634 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]
amitmittal5 commented on issue #9172: URL: https://github.com/apache/iceberg/issues/9172#issuecomment-1832775989 I also tested with latest version, **iceberg-spark-runtime-3.4_2.12-1.4.2.jar** as well, I could see that the second number, part of the file name, is continuously increasing 1-**3200**-11773075-523f-4667-936b-88702fe9860c-1.parquet, however after around 200 execution of stream, the file name got reset 1-3166-11773075-523f-4667-936b-88702fe9860c-1.parquet and files were started getting overwritten. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] metadata json conflict when streaming [iceberg]
amitmittal5 commented on issue #9171: URL: https://github.com/apache/iceberg/issues/9171#issuecomment-1832783780 Hello, I am also running a spark streaming job with latest version of spark and iceberg, however seeing the data file is getting overwritten in subsequent stream execution. I have raised my issue here https://github.com/apache/iceberg/issues/9172, so just wondering if it is the same root cause for our issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[PR] Build: Bump mypy-boto3-glue from 1.29.2 to 1.33.0 [iceberg-python]
dependabot[bot] opened a new pull request, #171: URL: https://github.com/apache/iceberg-python/pull/171 Bumps [mypy-boto3-glue](https://github.com/youtype/mypy_boto3_builder) from 1.29.2 to 1.33.0. Commits See full diff in https://github.com/youtype/mypy_boto3_builder/commits";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]
amogh-jahagirdar commented on PR #9183: URL: https://github.com/apache/iceberg/pull/9183#issuecomment-1832872079 Thanks for the PR @bartash I'm taking a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]
amogh-jahagirdar commented on code in PR #9183: URL: https://github.com/apache/iceberg/pull/9183#discussion_r1410002841 ## core/src/main/java/org/apache/iceberg/BaseTransaction.java: ## @@ -446,20 +446,16 @@ private void commitSimpleTransaction() { } Set committedFiles = committedFiles(ops, newSnapshots); - if (committedFiles != null) { -// delete all of the files that were deleted in the most recent set of operation commits -Tasks.foreach(deletedFiles) -.suppressFailureWhenFinished() -.onFailure((file, exc) -> LOG.warn("Failed to delete uncommitted file: {}", file, exc)) -.run( -path -> { - if (!committedFiles.contains(path)) { -ops.io().deleteFile(path); - } -}); - } else { -LOG.warn("Failed to load metadata for a committed snapshot, skipping clean-up"); - } + // delete all of the files that were deleted in the most recent set of operation commits + Tasks.foreach(deletedFiles) + .suppressFailureWhenFinished() + .onFailure((file, exc) -> LOG.warn("Failed to delete uncommitted file: {}", file, exc)) + .run( + path -> { +if (committedFiles == null || !committedFiles.contains(path)) { Review Comment: Ah yes, good catch. If there are no committed files (which would be expected for a transaction with just `ExpireSnapshots`) but there are files to cleanup (which would be expected for `ExpireSnapshots` again) then we should proceed with the file removal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] How to set Spark conf to use Parquet and Iceberg tables using glue catalog without catalog name(spark_catalog)? [iceberg]
github-actions[bot] commented on issue #7748: URL: https://github.com/apache/iceberg/issues/7748#issuecomment-1832897578 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] When using flink dataframe api for upsert, the deduplication effect is not achieved [iceberg]
github-actions[bot] commented on issue #7639: URL: https://github.com/apache/iceberg/issues/7639#issuecomment-1832897644 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg with Hive Metastore does not create a catalog in Spark and uses default [iceberg]
github-actions[bot] closed issue #7574: Iceberg with Hive Metastore does not create a catalog in Spark and uses default URL: https://github.com/apache/iceberg/issues/7574 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Iceberg with Hive Metastore does not create a catalog in Spark and uses default [iceberg]
github-actions[bot] commented on issue #7574: URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1832897692 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]
amogh-jahagirdar commented on code in PR #9183: URL: https://github.com/apache/iceberg/pull/9183#discussion_r1410012734 ## core/src/test/java/org/apache/iceberg/TestSequenceNumberForV2Table.java: ## @@ -319,12 +321,18 @@ public void testExpirationInTransaction() { V2Assert.assertEquals("Snapshot sequence number should be 2", 2, snap2.sequenceNumber()); V2Assert.assertEquals( "Last sequence number should be 2", 2, readMetadata().lastSequenceNumber()); +V2Assert.assertEquals( +"Should be 2 snapshot files", listSnapshotFiles(table.location()).size(), 2); Transaction txn = table.newTransaction(); txn.expireSnapshots().expireSnapshotId(commitId1).commit(); txn.commitTransaction(); V2Assert.assertEquals( "Last sequence number should be 2", 2, readMetadata().lastSequenceNumber()); +V2Assert.assertEquals( +"Should be 1 snapshot file as 1 was deleted", Review Comment: Nit: Typically we use the term "manifest list" instead of "snapshot file" could we call it that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Does iceberg flink streaming read support recover from last ckp? [iceberg]
qianzhen0 closed issue #9175: Does iceberg flink streaming read support recover from last ckp? URL: https://github.com/apache/iceberg/issues/9175 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] An exception occurred while writing iceberg data through Spark: org. apache. iceberg. exceptions. CommitFailedException: metadata location has changed [iceberg]
AllenWee1106 commented on issue #9178: URL: https://github.com/apache/iceberg/issues/9178#issuecomment-1832991874 @nk1506 Thank you for your reply. Based on your analysis, I have the following questions 1. I have multiple Spark tasks running at the same time. According to your analysis, is it possible that among these tasks, data was written to the same Iceberg table at the same time, which led to the occurrence of this problem? 2. In coding, I used a loop approach to execute Spark code (but the loop also executes one after another, not in parallel). Is this the reason for the problem? =  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support UnboundPartitionSpec [iceberg-rust]
my-vegetable-has-exploded commented on code in PR #106: URL: https://github.com/apache/iceberg-rust/pull/106#discussion_r1410082084 ## crates/iceberg/src/spec/partition.rs: ## @@ -60,6 +62,99 @@ impl PartitionSpec { } } +static PARTITION_DATA_ID_START: i32 = 1000; + +/// Reference to [`UnboundPartitionSpec`]. +pub type UnboundPartitionSpecRef = Arc; +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)] +#[serde(rename_all = "kebab-case")] +/// Unbound partition field can be built without a schema and later bound to a schema. +pub struct UnboundPartitionField { +/// A source column id from the table’s schema +pub source_id: i32, +/// A partition field id that is used to identify a partition field and is unique within a partition spec. +/// In v2 table metadata, it is unique across all partition specs. +pub partition_id: Option, +/// A partition name. +pub name: String, +/// A transform that is applied to the source column to produce a partition value. +pub transform: Transform, +} + +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default, Builder)] Review Comment: I don't find a neat way to add ```#[builder(setter(each(name = "with_partition_field")))]``` in TypedBuilder. Accoding to [this](https://github.com/idanarye/rust-typed-builder/blob/168e076bae35399040a1c973901161023e48e13b/src/lib.rs#L218C19-L218C27), it may need to use mutators? Maybe we can try later? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support UnboundPartitionSpec [iceberg-rust]
my-vegetable-has-exploded commented on code in PR #106: URL: https://github.com/apache/iceberg-rust/pull/106#discussion_r1410083813 ## crates/iceberg/src/spec/partition.rs: ## @@ -60,6 +62,99 @@ impl PartitionSpec { } } +static PARTITION_DATA_ID_START: i32 = 1000; + +/// Reference to [`UnboundPartitionSpec`]. +pub type UnboundPartitionSpecRef = Arc; +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)] +#[serde(rename_all = "kebab-case")] +/// Unbound partition field can be built without a schema and later bound to a schema. +pub struct UnboundPartitionField { +/// A source column id from the table’s schema +pub source_id: i32, +/// A partition field id that is used to identify a partition field and is unique within a partition spec. +/// In v2 table metadata, it is unique across all partition specs. +pub partition_id: Option, +/// A partition name. +pub name: String, +/// A transform that is applied to the source column to produce a partition value. +pub transform: Transform, +} + +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default, Builder)] +#[serde(rename_all = "kebab-case")] +#[builder(setter(prefix = "with"))] +/// Unbound partition spec can be built without a schema and later bound to a schema. +pub struct UnboundPartitionSpec { +/// Identifier for PartitionSpec +pub spec_id: Option, +/// Details of the partition spec +#[builder(setter(each(name = "with_unbound_partition_field")))] +pub fields: Vec, +} + +impl UnboundPartitionSpec { +/// last assigned id for partitioned field +pub fn unpartitioned_last_assigned_id() -> i32 { +PARTITION_DATA_ID_START - 1 +} + +/// Create unbound partition spec builer +pub fn builder() -> UnboundPartitionSpecBuilder { +UnboundPartitionSpecBuilder::default() +} + +/// Bind unbound partition spec to a schema +pub fn bind(&self, schema: SchemaRef) -> Result { Review Comment: I'm also a little confused about the process. If there's a need, I can try it then. Thanks @liurenjie1024. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore: Add cargo build and build guide [iceberg-rust]
liurenjie1024 commented on code in PR #111: URL: https://github.com/apache/iceberg-rust/pull/111#discussion_r1410084539 ## CONTRIBUTING.md: ## @@ -108,6 +108,26 @@ $ cargo version cargo 1.69.0 (6e9a83356 2023-04-12) ``` +## Build + +### Compile + +```shell +make build +``` + +### Lint + +```shell +make check +``` + +### Test + +```shell +make test +``` Review Comment: Instead of using headers, how about using item lists: ``` * To invoke a build: `make build` * To check code styles: `make check` * To run tests: `make test` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore: Add cargo build and build guide [iceberg-rust]
manuzhang commented on code in PR #111: URL: https://github.com/apache/iceberg-rust/pull/111#discussion_r1410085665 ## CONTRIBUTING.md: ## @@ -108,6 +108,26 @@ $ cargo version cargo 1.69.0 (6e9a83356 2023-04-12) ``` +## Build + +### Compile + +```shell +make build +``` + +### Lint + +```shell +make check +``` + +### Test + +```shell +make test +``` Review Comment: I'm trying to follow https://py.iceberg.apache.org/contributing/ but not sure yet what to add else for each command. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support UnboundPartitionSpec [iceberg-rust]
liurenjie1024 commented on code in PR #106: URL: https://github.com/apache/iceberg-rust/pull/106#discussion_r1410088445 ## crates/iceberg/src/spec/partition.rs: ## @@ -60,6 +60,44 @@ impl PartitionSpec { } } +/// Reference to [`UnboundPartitionSpec`]. +pub type UnboundPartitionSpecRef = Arc; +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)] +#[serde(rename_all = "kebab-case")] +/// Unbound partition field can be built without a schema and later bound to a schema. +pub struct UnboundPartitionField { +/// A source column id from the table’s schema +pub source_id: i32, +/// A partition field id that is used to identify a partition field and is unique within a partition spec. +/// In v2 table metadata, it is unique across all partition specs. +#[builder(default, setter(strip_option))] +pub partition_id: Option, +/// A partition name. +pub name: String, +/// A transform that is applied to the source column to produce a partition value. +pub transform: Transform, +} + +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default, Builder)] +#[serde(rename_all = "kebab-case")] +#[builder(setter(prefix = "with"))] +/// Unbound partition spec can be built without a schema and later bound to a schema. +pub struct UnboundPartitionSpec { Review Comment: Could you add a test of ser/de from json for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore: Add cargo build and build guide [iceberg-rust]
liurenjie1024 commented on code in PR #111: URL: https://github.com/apache/iceberg-rust/pull/111#discussion_r1410084539 ## CONTRIBUTING.md: ## @@ -108,6 +108,26 @@ $ cargo version cargo 1.69.0 (6e9a83356 2023-04-12) ``` +## Build + +### Compile + +```shell +make build +``` + +### Lint + +```shell +make check +``` + +### Test + +```shell +make test +``` Review Comment: Instead of using headers, how about using item lists: * To invoke a build: `make build` * To check code styles: `make check` * To run tests: `make test` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] chore: Add cargo build and build guide [iceberg-rust]
liurenjie1024 commented on code in PR #111: URL: https://github.com/apache/iceberg-rust/pull/111#discussion_r1410090267 ## CONTRIBUTING.md: ## @@ -108,6 +108,26 @@ $ cargo version cargo 1.69.0 (6e9a83356 2023-04-12) ``` +## Build + +### Compile + +```shell +make build +``` + +### Lint + +```shell +make check +``` + +### Test + +```shell +make test +``` Review Comment: Currently we don't have so much commands, so I think the list format would be more concise? We can always enrich these when necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] feat: support UnboundPartitionSpec [iceberg-rust]
my-vegetable-has-exploded commented on code in PR #106: URL: https://github.com/apache/iceberg-rust/pull/106#discussion_r1410103369 ## crates/iceberg/src/spec/partition.rs: ## @@ -60,6 +60,44 @@ impl PartitionSpec { } } +/// Reference to [`UnboundPartitionSpec`]. +pub type UnboundPartitionSpecRef = Arc; +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)] +#[serde(rename_all = "kebab-case")] +/// Unbound partition field can be built without a schema and later bound to a schema. +pub struct UnboundPartitionField { +/// A source column id from the table’s schema +pub source_id: i32, +/// A partition field id that is used to identify a partition field and is unique within a partition spec. +/// In v2 table metadata, it is unique across all partition specs. +#[builder(default, setter(strip_option))] +pub partition_id: Option, +/// A partition name. +pub name: String, +/// A transform that is applied to the source column to produce a partition value. +pub transform: Transform, +} + +#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default, Builder)] +#[serde(rename_all = "kebab-case")] +#[builder(setter(prefix = "with"))] +/// Unbound partition spec can be built without a schema and later bound to a schema. +pub struct UnboundPartitionSpec { Review Comment: Sure! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] API, Spark: Fix aggregation pushdown on struct fields [iceberg]
huaxingao commented on code in PR #9176: URL: https://github.com/apache/iceberg/pull/9176#discussion_r1410118039 ## api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java: ## @@ -30,13 +30,16 @@ protected ValueAggregate(Operation op, BoundTerm term) { @Override public T eval(StructLike struct) { -return term().eval(struct); +if (struct.size() > 1) { + throw new UnsupportedOperationException("Expected struct like of size 1"); +} + +return (T) struct.get(0, term().type().typeId().javaClass()); } @Override public T eval(DataFile file) { -valueStruct.setValue(evaluateRef(file)); -return term().eval(valueStruct); +return (T) evaluateRef(file); Review Comment: Seems OK to me to just return `evaluateRef(file)`. I'd like to get @rdblue 's opinion on this as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Use avro compression properties from table properties when writing manifests and manifest lists [iceberg]
wypoon commented on PR #6799: URL: https://github.com/apache/iceberg/pull/6799#issuecomment-1833125326 Hmm, I think TestExpireSnapshotsAction > dataFilesCleanupWithParallelTasks might be a flaky test? All Spark 3 tests passed with Java 8 and 11, and even for Java 17, they passed for Scala 2.13, so it seems unlikely that there is a problem just with Scala 2.12 on Java 17. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [I] Failed to create namespace using spark sql based on iceberg hadoop catalog (rest catalog) [iceberg]
TCGOGOGO commented on issue #9072: URL: https://github.com/apache/iceberg/issues/9072#issuecomment-1833130222 Any update for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
nk1506 commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410204384 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -328,6 +346,14 @@ private void setHmsTableParameters( Set obsoleteProps, boolean hiveEngineEnabled, Map summary) { + +setHmsParameters( Review Comment: I am not able to relate to any impacts of re-ordering. Could you please help me with some examples ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
nk1506 commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410206174 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -162,8 +163,11 @@ protected void doRefresh() { Thread.currentThread().interrupt(); throw new RuntimeException("Interrupted during refresh", e); } - -refreshFromMetadataLocation(metadataLocation, metadataRefreshMaxRetries); +if (table != null && table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { + disableRefresh(); Review Comment: If corresponding metadata of view type. `TableMetadataParser` will fail to parse. same applicable for `ViewMetadataParser`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
nk1506 commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410206641 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -250,14 +262,15 @@ protected void doCommit(TableMetadata base, TableMetadata metadata) { + "iceberg.hive.lock-heartbeat-interval-ms.", le); } catch (org.apache.hadoop.hive.metastore.api.AlreadyExistsException e) { -throw new AlreadyExistsException(e, "Table already exists: %s.%s", database, tableName); - +throw new org.apache.iceberg.exceptions.AlreadyExistsException( Review Comment: I have reverted to make it consistent with other places. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
nk1506 commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410207565 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java: ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Locale; +import java.util.Map; +import java.util.Objects; +import java.util.Optional; +import java.util.concurrent.atomic.AtomicReference; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.InvalidObjectException; +import org.apache.hadoop.hive.metastore.api.NoSuchObjectException; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.TableProperties; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.CommitFailedException; +import org.apache.iceberg.exceptions.CommitStateUnknownException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.util.Tasks; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** Hive implementation of Iceberg ViewOperations. */ +final class HiveViewOperations extends BaseViewOperations implements HiveOperationsBase { + private static final Logger LOG = LoggerFactory.getLogger(HiveViewOperations.class); + + private final String fullName; + private final String database; + private final String viewName; + private final FileIO fileIO; + private final ClientPool metaClients; + private final long maxHiveTablePropertySize; + + HiveViewOperations( + Configuration conf, + ClientPool metaClients, + FileIO fileIO, + String catalogName, + TableIdentifier viewIdentifier) { +String dbName = viewIdentifier.namespace().level(0); +this.metaClients = metaClients; +this.fileIO = fileIO; +this.fullName = catalogName + "." + dbName + "." + viewIdentifier.name(); +this.database = dbName; +this.viewName = viewIdentifier.name(); +this.maxHiveTablePropertySize = +conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT); + } + + @Override + public void doRefresh() { +String metadataLocation = null; +Table table = null; + +try { + table = metaClients.run(client -> client.getTable(database, viewName)); + HiveOperationsBase.validateTableIsIceberg(table, fullName); + metadataLocation = + table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP); + +} catch (NoSuchObjectException e) { + if (currentMetadataLocation() != null) { +throw new NoSuchViewException("View does not exist: %s.%s", database, viewName); + } +} catch (TException e) { + String errMsg = + String.format("Failed to get view info from metastore %s.%s", database, viewName); + throw new RuntimeException(errMsg, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted during refresh", e); +} + +if (table != null && !table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { + disableRefresh(); Review Comment: if table is not of type view, `ViewMetadataParser` will fail to parse the corresponding json file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact I
[PR] Core: Suppress exceptions in case of dropTableData [iceberg]
nk1506 opened a new pull request, #9184: URL: https://github.com/apache/iceberg/pull/9184 With [dropTableData](https://github.com/apache/iceberg/blob/d247b20f166ccb0b92443d4b05330b1e0d9c5d49/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L86) we plan to delete orphan files as many as possible. With partial failure a snapshot can be corrupted. Intent here is to ignore the corrupted snapshot and continue for the remaining snapshots and delete the related files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Core: Use avro compression properties from table properties when writing manifests and manifest lists [iceberg]
wypoon commented on PR #6799: URL: https://github.com/apache/iceberg/pull/6799#issuecomment-1833208488 @nastra @aokolnychyi I have rebased on main and resolved the conflicts with the `RewriteManifestsSparkAction` refactoring. As I mentioned before, I have introduced `ManifestWriter.Options` and `ManifestListWriter.Options` to be passed to `ManifestFiles.write` and `ManifestLists.write`. These `Options` classes provide defined options (currently just compression codec and level but extensible in future) that may be passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
Re: [PR] Hive: Add View support for HIVE catalog [iceberg]
nk1506 commented on code in PR #8907: URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410207565 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java: ## @@ -0,0 +1,315 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.iceberg.hive; + +import java.util.Locale; +import java.util.Map; +import java.util.Objects; +import java.util.Optional; +import java.util.concurrent.atomic.AtomicReference; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.IMetaStoreClient; +import org.apache.hadoop.hive.metastore.TableType; +import org.apache.hadoop.hive.metastore.api.InvalidObjectException; +import org.apache.hadoop.hive.metastore.api.NoSuchObjectException; +import org.apache.hadoop.hive.metastore.api.Table; +import org.apache.iceberg.BaseMetastoreTableOperations; +import org.apache.iceberg.ClientPool; +import org.apache.iceberg.TableProperties; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.CommitFailedException; +import org.apache.iceberg.exceptions.CommitStateUnknownException; +import org.apache.iceberg.exceptions.NoSuchViewException; +import org.apache.iceberg.exceptions.ValidationException; +import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus; +import org.apache.iceberg.io.FileIO; +import org.apache.iceberg.relocated.com.google.common.collect.Maps; +import org.apache.iceberg.util.Tasks; +import org.apache.iceberg.view.BaseViewOperations; +import org.apache.iceberg.view.ViewMetadata; +import org.apache.thrift.TException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** Hive implementation of Iceberg ViewOperations. */ +final class HiveViewOperations extends BaseViewOperations implements HiveOperationsBase { + private static final Logger LOG = LoggerFactory.getLogger(HiveViewOperations.class); + + private final String fullName; + private final String database; + private final String viewName; + private final FileIO fileIO; + private final ClientPool metaClients; + private final long maxHiveTablePropertySize; + + HiveViewOperations( + Configuration conf, + ClientPool metaClients, + FileIO fileIO, + String catalogName, + TableIdentifier viewIdentifier) { +String dbName = viewIdentifier.namespace().level(0); +this.metaClients = metaClients; +this.fileIO = fileIO; +this.fullName = catalogName + "." + dbName + "." + viewIdentifier.name(); +this.database = dbName; +this.viewName = viewIdentifier.name(); +this.maxHiveTablePropertySize = +conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT); + } + + @Override + public void doRefresh() { +String metadataLocation = null; +Table table = null; + +try { + table = metaClients.run(client -> client.getTable(database, viewName)); + HiveOperationsBase.validateTableIsIceberg(table, fullName); + metadataLocation = + table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP); + +} catch (NoSuchObjectException e) { + if (currentMetadataLocation() != null) { +throw new NoSuchViewException("View does not exist: %s.%s", database, viewName); + } +} catch (TException e) { + String errMsg = + String.format("Failed to get view info from metastore %s.%s", database, viewName); + throw new RuntimeException(errMsg, e); +} catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new RuntimeException("Interrupted during refresh", e); +} + +if (table != null && !table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) { + disableRefresh(); Review Comment: if table is not of type view, `ViewMetadataParser` will fail to parse the corresponding json file. Stacktrace without this change: `java.lang.IllegalArgumentException: Cannot parse missing string: view-uuid at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:218) at org.apache.iceberg.util.JsonUtil.get
Re: [PR] feat: replace 'Builder' with 'TypedBuilder' for 'Snapshot' [iceberg-rust]
xiaoyang-sde commented on PR #110: URL: https://github.com/apache/iceberg-rust/pull/110#issuecomment-1833246705 cc @Fokko -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org