Re: [I] Discussion: Replace usage of `Builder` with `TypedBuilder` [iceberg-rust]

2023-11-29 Thread via GitHub


xiaoyang-sde commented on issue #88:
URL: https://github.com/apache/iceberg-rust/issues/88#issuecomment-1831394325

   > Hi, @xiaoyang-sde Welcome to contribute! This seems reasonable to me.
   
   Thanks! Could you please assign this issue to me?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Discussion: Replace usage of `Builder` with `TypedBuilder` [iceberg-rust]

2023-11-29 Thread via GitHub


liurenjie1024 commented on issue #88:
URL: https://github.com/apache/iceberg-rust/issues/88#issuecomment-1831404891

   > > Hi, @xiaoyang-sde Welcome to contribute! This seems reasonable to me.
   > 
   > Thanks! Could you please assign this issue to me?
   
   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: Add a table property to control the Parquet row-group size of position delete files [iceberg]

2023-11-29 Thread via GitHub


zhongyujiang commented on PR #9177:
URL: https://github.com/apache/iceberg/pull/9177#issuecomment-1831405339

   The [CI 
failure](https://github.com/apache/iceberg/actions/runs/7029032258/job/19125970118#step:6:214)
 seems unreleated:
   
   ```
   > Task :iceberg-flink:iceberg-flink-1.17:test
   
   TestIcebergSourceWithWatermarkExtractor > testThrottling FAILED
   java.lang.AssertionError: 
   Expecting
 
   to be completed within 2M.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-29 Thread via GitHub


github-raphael-douyere commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1831463021

   @Fokko  I don't know how to have a simple and reproductible setup. We had 
the issue at a rate of ~10 files per week with an app producing hundreds of 
files per hour. 
   
   @amogh-jahagirdar  And yes I know that the file name is not only the query 
id. But I think the other elements can definitively repeat (`taskId` and 
`partitionId`). What I'm not sure of is the `fileCount` part. I think it is 
kept in memory but resets when the app is restarted (ie: not part of the 
state). So my point is: with a UUID this can't happen (barring the UUID 
collision) as whatever collisions on the other part of the filename are handled 
by a uniq part. 
   Another fix could be to keep the `operationId` but add an UUID as well. This 
would extend the file names a little bit but is probably fine to avoid data 
loss issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Does iceberg flink streaming read support recover from last ckp? [iceberg]

2023-11-29 Thread via GitHub


qianzhen0 commented on issue #9175:
URL: https://github.com/apache/iceberg/issues/9175#issuecomment-1831508633

   @pvary  thanks for the input!
   
   by saying `IcebergSource`, do you mean iceberg flink connector? or spark 
engine? 
   
   So, if i run `insert into sink select * from iceberg_table` in streaming 
mode (with ckp interval configured), it should recover reading from last ckp.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] An exception occurred while writing iceberg data through Spark: org. apache. iceberg. exceptions. CommitFailedException: metadata location has changed [iceberg]

2023-11-29 Thread via GitHub


nk1506 commented on issue #9178:
URL: https://github.com/apache/iceberg/issues/9178#issuecomment-1831614906

   it seems concurrent commit issues. Is there by an chance you are running 
these queries in parallel? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] DataFile creation by file path seems wrong [iceberg]

2023-11-29 Thread via GitHub


gaborkaszab closed issue #7612: DataFile creation by file path seems wrong
URL: https://github.com/apache/iceberg/issues/7612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] DataFile creation by file path seems wrong [iceberg]

2023-11-29 Thread via GitHub


gaborkaszab commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1831734715

   https://github.com/apache/iceberg/pull/7744 solves this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Spark3 DSv2: Handle Snapshots of Type OVERWRITE - while reading Stream from Iceberg table [iceberg]

2023-11-29 Thread via GitHub


imonteroq commented on issue #2788:
URL: https://github.com/apache/iceberg/issues/2788#issuecomment-1831736292

   Hi @SreeramGarlapati , @tmnd1991 are there any plans to implement this. This 
is a selling point for Hudi over Iceberg.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409243044


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -264,6 +265,163 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 }
   }
 
+  @Override
+  public boolean viewExists(TableIdentifier identifier) {
+return HiveCatalogUtil.isTableWithTypeExists(clients, identifier, 
TableType.VIRTUAL_VIEW);
+  }
+
+  @Override
+  public boolean dropView(TableIdentifier identifier) {

Review Comment:
   I am asking, because for the `dropTable` we have a `purge` flag which drives 
if the metadata is removed, or not...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409251564


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##
@@ -162,8 +163,11 @@ protected void doRefresh() {
   Thread.currentThread().interrupt();
   throw new RuntimeException("Interrupted during refresh", e);
 }
-
-refreshFromMetadataLocation(metadataLocation, metadataRefreshMaxRetries);
+if (table != null && 
table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  disableRefresh();

Review Comment:
   I am curious, do you know why we do not refresh?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409253203


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##
@@ -250,14 +262,15 @@ protected void doCommit(TableMetadata base, TableMetadata 
metadata) {
 + "iceberg.hive.lock-heartbeat-interval-ms.",
 le);
   } catch (org.apache.hadoop.hive.metastore.api.AlreadyExistsException e) {
-throw new AlreadyExistsException(e, "Table already exists: %s.%s", 
database, tableName);
-
+throw new org.apache.iceberg.exceptions.AlreadyExistsException(

Review Comment:
   Question: Did you intentionally used fully qualified name here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409256826


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java:
##
@@ -181,4 +186,30 @@ default Table newHmsTable(String hmsTableOwner) {
 
 return newTable;
   }
+
+  default void setHmsParameters(
+  Table tbl,
+  String newMetadataLocation,
+  Schema schema,
+  String uuid,
+  Supplier previousLocationSupplier) {
+Map parameters =
+Optional.ofNullable(tbl.getParameters()).orElseGet(Maps::newHashMap);
+
+if (uuid != null) {
+  parameters.put(TableProperties.UUID, uuid);
+}

Review Comment:
   nit: newline



##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java:
##
@@ -181,4 +186,30 @@ default Table newHmsTable(String hmsTableOwner) {
 
 return newTable;
   }
+
+  default void setHmsParameters(
+  Table tbl,
+  String newMetadataLocation,
+  Schema schema,
+  String uuid,
+  Supplier previousLocationSupplier) {
+Map parameters =
+Optional.ofNullable(tbl.getParameters()).orElseGet(Maps::newHashMap);
+
+if (uuid != null) {
+  parameters.put(TableProperties.UUID, uuid);
+}
+parameters.put(BaseMetastoreTableOperations.METADATA_LOCATION_PROP, 
newMetadataLocation);
+parameters.put(
+BaseMetastoreTableOperations.TABLE_TYPE_PROP,
+
BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(Locale.ENGLISH));
+
+if (previousLocationSupplier.get() != null && 
!previousLocationSupplier.get().isEmpty()) {
+  parameters.put(
+  BaseMetastoreTableOperations.PREVIOUS_METADATA_LOCATION_PROP,
+  previousLocationSupplier.get());
+}

Review Comment:
   nit: newline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409258375


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##
@@ -328,6 +346,14 @@ private void setHmsTableParameters(
   Set obsoleteProps,
   boolean hiveEngineEnabled,
   Map summary) {
+
+setHmsParameters(

Review Comment:
   Wouldn't the different ordering of this and the metadata translation cause 
changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409261131


##
.palantir/revapi.yml:
##
@@ -877,6 +877,27 @@ acceptedBreaks:
 - code: "java.field.serialVersionUIDChanged"
   new: "field org.apache.iceberg.util.SerializableMap.serialVersionUID"
   justification: "Serialization is not be used"
+- code: "java.method.nowStatic"
+  old: "method 
org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus 
org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\
+\ org.apache.iceberg.TableMetadata)"
+  new: "method 
org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus 
org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\
+\ java.lang.String, java.util.Map, 
java.util.function.Function)"
+  justification: "Increased visibility and refactor methods for view 
operation"
+- code: "java.method.numberOfParametersChanged"
+  old: "method 
org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus 
org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\
+\ org.apache.iceberg.TableMetadata)"
+  new: "method 
org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus 
org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\
+\ java.lang.String, java.util.Map, 
java.util.function.Function)"
+  justification: "Increased visibility and refactor methods for view 
operation"
+- code: "java.method.visibilityIncreased"

Review Comment:
   This is probably acceptable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1409262085


##
.palantir/revapi.yml:
##
@@ -877,6 +877,27 @@ acceptedBreaks:
 - code: "java.field.serialVersionUIDChanged"
   new: "field org.apache.iceberg.util.SerializableMap.serialVersionUID"
   justification: "Serialization is not be used"
+- code: "java.method.nowStatic"
+  old: "method 
org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus 
org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\
+\ org.apache.iceberg.TableMetadata)"
+  new: "method 
org.apache.iceberg.BaseMetastoreTableOperations.CommitStatus 
org.apache.iceberg.BaseMetastoreTableOperations::checkCommitStatus(java.lang.String,\
+\ java.lang.String, java.util.Map, 
java.util.function.Function)"
+  justification: "Increased visibility and refactor methods for view 
operation"
+- code: "java.method.numberOfParametersChanged"

Review Comment:
   I am not sure about this. Do we need a deprecation, and default method for 
this to be backward compatible?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Does iceberg flink streaming read support recover from last ckp? [iceberg]

2023-11-29 Thread via GitHub


pvary commented on issue #9175:
URL: https://github.com/apache/iceberg/issues/9175#issuecomment-1831887127

   `IcebergSource` is using the new Flink Source API - this is the newer one
   `FlinkSource` is using the old Flink SourceFunction API - this will be 
removed (I plan to deprecate it in the new Iceberg release)
   
   When using the table API the `table.exec.iceberg.use-flip27-source` config 
option could be used to chose between the 2 versions. See: 
https://iceberg.apache.org/docs/1.3.1/flink-queries/#flip-27-source-for-sql


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Table scan using functional filters [iceberg-python]

2023-11-29 Thread via GitHub


bigluck opened a new issue, #170:
URL: https://github.com/apache/iceberg-python/issues/170

   ### Feature Request / Improvement
   
   Ciao @Fokko
   Seems like `table.scan()` supports a limited set of `filter` conditions, and 
it fails when a user specifies a complex one.
   
   In my case, I have this input query:
   
   ```sql
   SELECT * FROM wind_energy_sensor_data where observed_at::date = '2020-03-03'
   ```
   
   Once mapped into a list of fields and filters using sqlglot, I get:
   
   ```
   {
   "filter": "CAST(observed_at AS DATE) = '2020-03-03'",
   "name": "wind_energy_sensor_data",
   "projections":
   [
   "*",
   "observed_at"
   ]
   }
   ```
   
   But then when I pass the filter to the `table.scan()` function, it raises 
this `ParseException`:
   
   ```python
scan = table.scan(
   ^^^
  File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 473, in scan
return DataScan(
   ^
  File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 773, in __init__
super().__init__(table, row_filter, selected_fields, case_sensitive, 
snapshot_id, options, limit)
  File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 629, in __init__
self.row_filter = _parse_row_filter(row_filter)
  ^
  File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 603, in 
_parse_row_filter
return parser.parse(expr) if isinstance(expr, str) else expr
   ^^
  File "/pip/pyiceberg/pyiceberg/expressions/parser.py", line 270, in parse
return boolean_expression.parse_string(expr, parse_all=True)[0]
   ^
  File "/pip/pyparsing/pyparsing/core.py", line 1197, in parse_string
raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: Expected expr, found '('  (at char 4), 
(line:1, col:5)
   ```
   
   How hard is it to extend support for functional filters (as CAST() in my 
case)? And more importantly, is it something that makes sense to have, or 
pyiceberg (and in general iceberg) expect these types of conditions in a 
different format?
   
   thanks,
   Luca


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump cryptography from 41.0.4 to 41.0.6 [iceberg-python]

2023-11-29 Thread via GitHub


Fokko merged PR #169:
URL: https://github.com/apache/iceberg-python/pull/169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update table metadata [iceberg-python]

2023-11-29 Thread via GitHub


Fokko commented on code in PR #139:
URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1409338251


##
pyiceberg/table/__init__.py:
##
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
 removals: List[str]
 
 
+class TableMetadataUpdateContext:
+updates: List[TableUpdate]
+last_added_schema_id: Optional[int]
+
+def __init__(self) -> None:
+self.updates = []
+self.last_added_schema_id = None
+
+def is_added_snapshot(self, snapshot_id: int) -> bool:
+return any(
+update.snapshot.snapshot_id == snapshot_id
+for update in self.updates
+if update.action == TableUpdateAction.add_snapshot
+)
+
+def is_added_schema(self, schema_id: int) -> bool:
+return any(
+update.schema_.schema_id == schema_id for update in self.updates 
if update.action == TableUpdateAction.add_schema
+)
+
+
+@singledispatch
+def apply_table_update(update: TableUpdate, base_metadata: TableMetadata, 
context: TableMetadataUpdateContext) -> TableMetadata:
+"""Apply a table update to the table metadata.
+
+Args:
+update: The update to be applied.
+base_metadata: The base metadata to be updated.
+context: Contains previous updates, last_added_snapshot_id and other 
change tracking information in the current transaction.
+
+Returns:
+The updated metadata.
+
+"""
+raise NotImplementedError(f"Unsupported table update: {update}")
+
+
+@apply_table_update.register(UpgradeFormatVersionUpdate)
+def _(update: UpgradeFormatVersionUpdate, base_metadata: TableMetadata, 
context: TableMetadataUpdateContext) -> TableMetadata:
+if update.format_version > SUPPORTED_TABLE_FORMAT_VERSION:
+raise ValueError(f"Unsupported table format version: 
{update.format_version}")
+
+if update.format_version < base_metadata.format_version:
+raise ValueError(f"Cannot downgrade v{base_metadata.format_version} 
table to v{update.format_version}")
+
+if update.format_version == base_metadata.format_version:
+return base_metadata
+
+updated_metadata_data = copy(base_metadata.model_dump())
+updated_metadata_data["format-version"] = update.format_version

Review Comment:
   In that case, we can be cautious and set `deep=true`. I would love to see 
some tests that validate the behavior. Those should be easy to add.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update table metadata [iceberg-python]

2023-11-29 Thread via GitHub


Fokko commented on code in PR #139:
URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1409336199


##
pyiceberg/table/__init__.py:
##
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
 removals: List[str]
 
 
+class TableMetadataUpdateContext:
+updates: List[TableUpdate]
+last_added_schema_id: Optional[int]

Review Comment:
   I prefer to keep them separate. I think we might need to have some 
additional checks here such as, what happens if you add a column, and then 
revoke the column again. It will first create a new schema, with a new ID, and 
then it will reuse the old schema again.
   
   - 1: Schema(a: int), current_schema_id=1
   Add column b:
   - 1: Schema(a: int), 2: Schema(a: int, b: int), current_schema_id=2
   Drop column b:
   - 1: Schema(a: int), 2: Schema(a: int, b: int), current_schema_id=1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update table metadata [iceberg-python]

2023-11-29 Thread via GitHub


Fokko commented on code in PR #139:
URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1409362760


##
pyiceberg/table/__init__.py:
##
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
 removals: List[str]
 
 
+class TableMetadataUpdateContext:
+updates: List[TableUpdate]
+last_added_schema_id: Optional[int]
+
+def __init__(self) -> None:
+self.updates = []
+self.last_added_schema_id = None
+
+def is_added_snapshot(self, snapshot_id: int) -> bool:
+return any(
+update.snapshot.snapshot_id == snapshot_id
+for update in self.updates
+if update.action == TableUpdateAction.add_snapshot
+)
+
+def is_added_schema(self, schema_id: int) -> bool:
+return any(
+update.schema_.schema_id == schema_id for update in self.updates 
if update.action == TableUpdateAction.add_schema
+)
+
+
+@singledispatch
+def apply_table_update(update: TableUpdate, base_metadata: TableMetadata, 
context: TableMetadataUpdateContext) -> TableMetadata:
+"""Apply a table update to the table metadata.
+
+Args:
+update: The update to be applied.
+base_metadata: The base metadata to be updated.
+context: Contains previous updates, last_added_snapshot_id and other 
change tracking information in the current transaction.
+
+Returns:
+The updated metadata.
+
+"""
+raise NotImplementedError(f"Unsupported table update: {update}")
+
+
+@apply_table_update.register(UpgradeFormatVersionUpdate)
+def _(update: UpgradeFormatVersionUpdate, base_metadata: TableMetadata, 
context: TableMetadataUpdateContext) -> TableMetadata:
+if update.format_version > SUPPORTED_TABLE_FORMAT_VERSION:
+raise ValueError(f"Unsupported table format version: 
{update.format_version}")
+
+if update.format_version < base_metadata.format_version:
+raise ValueError(f"Cannot downgrade v{base_metadata.format_version} 
table to v{update.format_version}")
+
+if update.format_version == base_metadata.format_version:
+return base_metadata
+
+updated_metadata_data = copy(base_metadata.model_dump())
+updated_metadata_data["format-version"] = update.format_version
+
+context.updates.append(update)
+return TableMetadataUtil.parse_obj(updated_metadata_data)
+
+
+@apply_table_update.register(AddSchemaUpdate)
+def _(update: AddSchemaUpdate, base_metadata: TableMetadata, context: 
TableMetadataUpdateContext) -> TableMetadata:
+def reuse_or_create_new_schema_id(new_schema: Schema) -> Tuple[int, bool]:
+"""Reuse schema id if schema already exists, otherwise create a new 
one.
+
+Args:
+new_schema: The new schema to be added.
+
+Returns:
+The new schema id and whether the schema already exists.
+"""
+result_schema_id = base_metadata.current_schema_id
+for schema in base_metadata.schemas:
+if schema == new_schema:
+return schema.schema_id, True
+elif schema.schema_id >= result_schema_id:
+result_schema_id = schema.schema_id + 1
+return result_schema_id, False
+
+if update.last_column_id < base_metadata.last_column_id:
+raise ValueError(f"Invalid last column id {update.last_column_id}, 
must be >= {base_metadata.last_column_id}")
+
+new_schema_id, schema_found = reuse_or_create_new_schema_id(update.schema_)
+if schema_found and update.last_column_id == base_metadata.last_column_id:
+if context.last_added_schema_id is not None and 
context.is_added_schema(new_schema_id):
+context.last_added_schema_id = new_schema_id
+return base_metadata
+
+updated_metadata_data = copy(base_metadata.model_dump())
+updated_metadata_data["last-column-id"] = update.last_column_id
+
+new_schema = (
+update.schema_
+if new_schema_id == update.schema_.schema_id
+else Schema(*update.schema_.fields, schema_id=new_schema_id, 
identifier_field_ids=update.schema_.identifier_field_ids)
+)
+
+if not schema_found:
+updated_metadata_data["schemas"].append(new_schema.model_dump())
+
+context.updates.append(update)
+context.last_added_schema_id = new_schema_id
+return TableMetadataUtil.parse_obj(updated_metadata_data)
+
+
+@apply_table_update.register(SetCurrentSchemaUpdate)
+def _(update: SetCurrentSchemaUpdate, base_metadata: TableMetadata, context: 
TableMetadataUpdateContext) -> TableMetadata:
+if update.schema_id == -1:
+if context.last_added_schema_id is None:
+raise ValueError("Cannot set current schema to last added schema 
when no schema has been added")
+return 
apply_table_update(SetCurrentSchemaUpdate(schema_id=context.last_added_schema_id),
 base_metadata, context)
+
+if update.schema_id == base_metadata.current_schema_id:
+return base_metadata
+
+schema = base_m

Re: [PR] Update table metadata [iceberg-python]

2023-11-29 Thread via GitHub


Fokko commented on code in PR #139:
URL: https://github.com/apache/iceberg-python/pull/139#discussion_r1409365694


##
pyiceberg/table/__init__.py:
##
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
 removals: List[str]
 
 
+class TableMetadataUpdateContext:
+updates: List[TableUpdate]
+last_added_schema_id: Optional[int]
+
+def __init__(self) -> None:
+self.updates = []
+self.last_added_schema_id = None
+
+def is_added_snapshot(self, snapshot_id: int) -> bool:
+return any(
+update.snapshot.snapshot_id == snapshot_id
+for update in self.updates
+if update.action == TableUpdateAction.add_snapshot
+)
+
+def is_added_schema(self, schema_id: int) -> bool:
+return any(
+update.schema_.schema_id == schema_id for update in self.updates 
if update.action == TableUpdateAction.add_schema
+)
+
+
+@singledispatch
+def apply_table_update(update: TableUpdate, base_metadata: TableMetadata, 
context: TableMetadataUpdateContext) -> TableMetadata:
+"""Apply a table update to the table metadata.
+
+Args:
+update: The update to be applied.
+base_metadata: The base metadata to be updated.
+context: Contains previous updates, last_added_snapshot_id and other 
change tracking information in the current transaction.
+
+Returns:
+The updated metadata.
+
+"""
+raise NotImplementedError(f"Unsupported table update: {update}")
+
+
+@apply_table_update.register(UpgradeFormatVersionUpdate)
+def _(update: UpgradeFormatVersionUpdate, base_metadata: TableMetadata, 
context: TableMetadataUpdateContext) -> TableMetadata:
+if update.format_version > SUPPORTED_TABLE_FORMAT_VERSION:
+raise ValueError(f"Unsupported table format version: 
{update.format_version}")
+
+if update.format_version < base_metadata.format_version:
+raise ValueError(f"Cannot downgrade v{base_metadata.format_version} 
table to v{update.format_version}")
+
+if update.format_version == base_metadata.format_version:
+return base_metadata
+
+updated_metadata_data = copy(base_metadata.model_dump())
+updated_metadata_data["format-version"] = update.format_version
+
+context.updates.append(update)
+return TableMetadataUtil.parse_obj(updated_metadata_data)
+
+
+@apply_table_update.register(AddSchemaUpdate)
+def _(update: AddSchemaUpdate, base_metadata: TableMetadata, context: 
TableMetadataUpdateContext) -> TableMetadata:
+def reuse_or_create_new_schema_id(new_schema: Schema) -> Tuple[int, bool]:
+"""Reuse schema id if schema already exists, otherwise create a new 
one.
+
+Args:
+new_schema: The new schema to be added.
+
+Returns:
+The new schema id and whether the schema already exists.
+"""
+result_schema_id = base_metadata.current_schema_id
+for schema in base_metadata.schemas:
+if schema == new_schema:
+return schema.schema_id, True
+elif schema.schema_id >= result_schema_id:
+result_schema_id = schema.schema_id + 1
+return result_schema_id, False
+
+if update.last_column_id < base_metadata.last_column_id:
+raise ValueError(f"Invalid last column id {update.last_column_id}, 
must be >= {base_metadata.last_column_id}")
+
+new_schema_id, schema_found = reuse_or_create_new_schema_id(update.schema_)
+if schema_found and update.last_column_id == base_metadata.last_column_id:
+if context.last_added_schema_id is not None and 
context.is_added_schema(new_schema_id):
+context.last_added_schema_id = new_schema_id
+return base_metadata
+
+updated_metadata_data = copy(base_metadata.model_dump())
+updated_metadata_data["last-column-id"] = update.last_column_id
+
+new_schema = (
+update.schema_
+if new_schema_id == update.schema_.schema_id
+else Schema(*update.schema_.fields, schema_id=new_schema_id, 
identifier_field_ids=update.schema_.identifier_field_ids)
+)

Review Comment:
   > Thanks for the explanation! Just to confirm my understanding, in 
pyiceberg, one transaction should only have one AddSchemaUpdate because all 
updates related to the schema should be accumulated via UpdateSchema.
   
   This is correct. Currently, the `Update` has to be unique, so stacking two 
updates will even raise a `ValueError`.
   
   > Hence, we can trust the schema_id in the update and do not need the logic 
to handle multiple schema added in one transaction.
   
   This is correct 👍 



##
pyiceberg/table/__init__.py:
##
@@ -350,6 +357,241 @@ class RemovePropertiesUpdate(TableUpdate):
 removals: List[str]
 
 
+class TableMetadataUpdateContext:
+updates: List[TableUpdate]
+last_added_schema_id: Optional[int]
+
+def __init__(self) -> None:
+self.updates = []
+self.last_added_schema

Re: [I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]

2023-11-29 Thread via GitHub


amitmittal5 commented on issue #9172:
URL: https://github.com/apache/iceberg/issues/9172#issuecomment-1832293641

   Just adding more details on the table to which the job is writting, I tried 
with both external and managed table, but see similar results. The behavior I 
see, that it took around 10 execution of streaming job (with AvailableNow 
trigger) to reproduce the error for managed table.
   
   ```
   CREATE EXTERNAL TABLE IF NOT EXISTS default.blob_iceberg 
   (id string, state string, name string) 
   USING ICEBERG 
   LOCATION 
'abfss://@.dfs.core.windows.net/test/blob_iceberg'
   ```
   
   And managed table
   ```
   CREATE TABLE IF NOT EXISTS default.blob_iceberg1 
   (id string, state string, name string) 
   USING ICEBERG 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump org.apache.httpcomponents.client5:httpclient5 from 5.2.1 to 5.2.2 [iceberg]

2023-11-29 Thread via GitHub


dependabot[bot] commented on PR #9157:
URL: https://github.com/apache/iceberg/pull/9157#issuecomment-1832326624

   Sorry, only users with push access can use that command.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump org.apache.httpcomponents.client5:httpclient5 from 5.2.1 to 5.2.2 [iceberg]

2023-11-29 Thread via GitHub


XN137 commented on PR #9157:
URL: https://github.com/apache/iceberg/pull/9157#issuecomment-1832326554

   @dependabot rebase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2023-11-29 Thread via GitHub


RussellSpitzer commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1409617517


##
api/src/main/java/org/apache/iceberg/types/TypeUtil.java:
##
@@ -452,6 +454,59 @@ private static void checkSchemaCompatibility(
 }
   }
 
+  public static long defaultSize(Types.NestedField field) {

Review Comment:
   Why do we need to cache the extra columns? We wouldn't be using them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Description on Using a Separate Authorization Server [iceberg]

2023-11-29 Thread via GitHub


mrcnc commented on code in PR #8998:
URL: https://github.com/apache/iceberg/pull/8998#discussion_r1409630922


##
open-api/rest-catalog-open-api.yaml:
##
@@ -2872,6 +2872,10 @@ components:
 For unauthorized requests, services should return an appropriate 401 or
 403 response. Implementations must not return altered success (200)
 responses when a request is unauthenticated or unauthorized.
+
+If a separate authorization server is used, subtitute the tokenUrl 
with 

Review Comment:
   ```suggestion
   If a separate authorization server is used, substitute the tokenUrl 
with 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Document watermark generation feature [iceberg]

2023-11-29 Thread via GitHub


stevenzwu commented on code in PR #9179:
URL: https://github.com/apache/iceberg/pull/9179#discussion_r1409630306


##
docs/flink-queries.md:
##
@@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, 
WatermarkStrategy.noWatermarks()
 "Iceberg Source as Avro GenericRecord", new 
GenericRecordAvroTypeInfo(avroSchema));
 ```
 
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several 
purposes, like harnessing the
+[Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment)
+feature to prevent runaway readers, or providing triggers for [Flink 
windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/).

Review Comment:
   I don't think we should call out `windows` specifically for the benefit of 
emitting watermark from source itself. any event time and watermark strategy 
will have windows triggers



##
docs/flink-queries.md:
##
@@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, 
WatermarkStrategy.noWatermarks()
 "Iceberg Source as Avro GenericRecord", new 
GenericRecordAvroTypeInfo(avroSchema));
 ```
 
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several 
purposes, like harnessing the
+[Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment)
+feature to prevent runaway readers, or providing triggers for [Flink 
windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/).
+
+Enable watermark generation for an `IcebergSource` by setting the 
`watermarkColumn`.
+The supported column types are `timestamp`, `timestamptz` and `long`.
+Timestamp columns are automatically converted to milliseconds since the Java 
epoch of
+1970-01-01T00:00:00Z. Use `watermarkTimeUnit` to configure the conversion for 
long columns.
+
+The watermarks are generated based on column metrics stored for data files and 
emitted once per split.
+When using watermarks for Flink watermark alignment set 
`read.split.open-file-cost` to prevent
+combining multiple files to a single split.
+By default, the column metrics are collected for the first 100 columns of the 
table. Use [write properties](configuration.md#write-properties) starting with 
`write.metadata.metrics` when needed.

Review Comment:
   break off this line to a new paragraph.



##
docs/flink-queries.md:
##
@@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, 
WatermarkStrategy.noWatermarks()
 "Iceberg Source as Avro GenericRecord", new 
GenericRecordAvroTypeInfo(avroSchema));
 ```
 
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several 
purposes, like harnessing the
+[Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment)
+feature to prevent runaway readers, or providing triggers for [Flink 
windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/).
+
+Enable watermark generation for an `IcebergSource` by setting the 
`watermarkColumn`.
+The supported column types are `timestamp`, `timestamptz` and `long`.
+Timestamp columns are automatically converted to milliseconds since the Java 
epoch of
+1970-01-01T00:00:00Z. Use `watermarkTimeUnit` to configure the conversion for 
long columns.

Review Comment:
   > Use `watermarkTimeUnit` to configure the conversion for long columns.
   
   Before this sentence, maybe elaborate a little more. 
   ```
   Iceberg `timestamp` or `timestamptz` inherently contains the time precision. 
So there is no need to specify the time unit. But `long` type column doesn't 
contain time unit information. Use  `watermarkTimeUnit` to configure the 
conversion for long columns.
   ```



##
docs/flink-queries.md:
##
@@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, 
WatermarkStrategy.noWatermarks()
 "Iceberg Source as Avro GenericRecord", new 
GenericRecordAvroTypeInfo(avroSchema));
 ```
 
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several 
purposes, like harnessing the
+[Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment)
+feature to prevent runaway readers, or providing triggers for [Flink 
windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/).
+
+Enable watermark generation for an `IcebergSource` by setting the 
`watermarkColumn`.
+The supported column types are `timestamp`, `timestamptz` and `long`.
+Timestamp columns a

Re: [PR] Flink: Document watermark generation feature [iceberg]

2023-11-29 Thread via GitHub


stevenzwu commented on PR #9179:
URL: https://github.com/apache/iceberg/pull/9179#issuecomment-1832386184

   @dchristle can you also help review this doc PR? your perspective can help 
improve the readability of the doc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2023-11-29 Thread via GitHub


RussellSpitzer commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1409668358


##
core/src/main/java/org/apache/iceberg/SystemConfigs.java:
##
@@ -43,14 +43,14 @@ private SystemConfigs() {}
   Integer::parseUnsignedInt);
 
   /**
-   * Sets the size of the delete worker pool. This limits the number of 
threads used to compute the
-   * PositionDeleteIndex from the position deletes for a data file.
+   * Sets the size of the delete worker pool. This limits the number of 
threads used to read delete
+   * files for a data file.
*/
   public static final ConfigEntry DELETE_WORKER_THREAD_POOL_SIZE =
   new ConfigEntry<>(
   "iceberg.worker.delete-num-threads",
   "ICEBERG_WORKER_DELETE_NUM_THREADS",
-  Math.max(2, Runtime.getRuntime().availableProcessors()),
+  Math.max(2, 4 * Runtime.getRuntime().availableProcessors()),

Review Comment:
   Not a huge deal but we are avoiding the RevCheck here by putting our 
multiplier in a constant here. We should probably move the 4 into a field so 
future modifications trigger the Rev checker



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2023-11-29 Thread via GitHub


RussellSpitzer commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1409670268


##
core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java:
##
@@ -27,6 +27,15 @@ class BitmapPositionDeleteIndex implements 
PositionDeleteIndex {
 roaring64Bitmap = new Roaring64Bitmap();
   }
 
+  void merge(PositionDeleteIndex other) {

Review Comment:
   Why not just only allow BitmapPositionDeleteIndex here? Do we not have the 
type when we call merge?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Document watermark generation feature [iceberg]

2023-11-29 Thread via GitHub


mas-chen commented on code in PR #9179:
URL: https://github.com/apache/iceberg/pull/9179#discussion_r1409704342


##
docs/flink-queries.md:
##
@@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, 
WatermarkStrategy.noWatermarks()
 "Iceberg Source as Avro GenericRecord", new 
GenericRecordAvroTypeInfo(avroSchema));
 ```
 
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several 
purposes, like harnessing the
+[Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment)

Review Comment:
   Should we use the stable documentation to avoid stale links? Ditto on the 
other references to Flink docs
   ```suggestion
   [Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment)
   ```



##
docs/flink-queries.md:
##
@@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, 
WatermarkStrategy.noWatermarks()
 "Iceberg Source as Avro GenericRecord", new 
GenericRecordAvroTypeInfo(avroSchema));
 ```
 
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several 
purposes, like harnessing the
+[Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment)
+feature to prevent runaway readers, or providing triggers for [Flink 
windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/).
+
+Enable watermark generation for an `IcebergSource` by setting the 
`watermarkColumn`.
+The supported column types are `timestamp`, `timestamptz` and `long`.
+Timestamp columns are automatically converted to milliseconds since the Java 
epoch of
+1970-01-01T00:00:00Z. Use `watermarkTimeUnit` to configure the conversion for 
long columns.
+
+The watermarks are generated based on column metrics stored for data files and 
emitted once per split.
+When using watermarks for Flink watermark alignment set 
`read.split.open-file-cost` to prevent
+combining multiple files to a single split.
+By default, the column metrics are collected for the first 100 columns of the 
table. Use [write properties](configuration.md#write-properties) starting with 
`write.metadata.metrics` when needed.
+
+```java
+StreamExecutionEnvironment env = 
StreamExecutionEnvironment.createLocalEnvironment();
+TableLoader tableLoader = 
TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
+
+// For windowing
+DataStream stream =
+env.fromSource(
+IcebergSource.forRowData()
+.tableLoader(tableLoader)
+// Watermark using timestamp column
+.watermarkColumn("timestamp_column")
+.build(),
+// Watermarks are generated by the source, no need to generate it 
manually
+WatermarkStrategy.noWatermarks()
+// Extract event timestamp from records
+.withTimestampAssigner((record, eventTime) -> 
record.getTimestamp(pos, precision).getMillisecond()),
+SOURCE_NAME,
+TypeInformation.of(RowData.class));
+
+// For watermark alignment
+DataStream stream =
+env.fromSource(
+IcebergSource source = IcebergSource.forRowData()
+.tableLoader(tableLoader)
+// Disable combining multiple files to a single split 
+.set(FlinkReadOptions.SPLIT_FILE_OPEN_COST, 
String.valueOf(TableProperties.SPLIT_SIZE_DEFAULT))
+// Watermark using long column
+.watermarkColumn("long_column")
+.watermarkTimeUnit(TimeUnit.MILLI_SCALE)

Review Comment:
   I'd include this in the previous example. I read this as a more advanced 
example as most users wouldn't need watermark alignment and so 
`withTimestampAssigner` could also be moved down here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Document watermark generation feature [iceberg]

2023-11-29 Thread via GitHub


mas-chen commented on code in PR #9179:
URL: https://github.com/apache/iceberg/pull/9179#discussion_r1409722185


##
docs/flink-queries.md:
##
@@ -277,6 +277,58 @@ DataStream stream = env.fromSource(source, 
WatermarkStrategy.noWatermarks()
 "Iceberg Source as Avro GenericRecord", new 
GenericRecordAvroTypeInfo(avroSchema));
 ```
 
+### Emitting watermarks
+Emitting watermarks from the source itself could be beneficial for several 
purposes, like harnessing the
+[Flink Watermark 
Alignment](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment)
+feature to prevent runaway readers, or providing triggers for [Flink 
windowing](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/datastream/operators/windows/).
+
+Enable watermark generation for an `IcebergSource` by setting the 
`watermarkColumn`.
+The supported column types are `timestamp`, `timestamptz` and `long`.
+Timestamp columns are automatically converted to milliseconds since the Java 
epoch of
+1970-01-01T00:00:00Z. Use `watermarkTimeUnit` to configure the conversion for 
long columns.
+
+The watermarks are generated based on column metrics stored for data files and 
emitted once per split.
+When using watermarks for Flink watermark alignment set 
`read.split.open-file-cost` to prevent
+combining multiple files to a single split.

Review Comment:
   Is there a tradeoff for increasing `read.split.open-file-cost`? Would a user 
need to also tune the split throttling parameter to avoid memory issues?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Fix IcebergSource tableloader lifecycle management in batch mode [iceberg]

2023-11-29 Thread via GitHub


mas-chen commented on code in PR #9173:
URL: https://github.com/apache/iceberg/pull/9173#discussion_r1409747023


##
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java:
##
@@ -120,26 +122,30 @@ private String planningThreadName() {
 // a public API like the protected method "OperatorCoordinator.Context 
getCoordinatorContext()"
 // from SourceCoordinatorContext implementation. For now, - is used as
 // the unique thread pool name.
-return lazyTable().name() + "-" + UUID.randomUUID();
+return tableName() + "-" + UUID.randomUUID();
   }
 
-  private List planSplitsForBatch(String threadName) {
+  private List planSplitsForBatch(
+  TableLoader plannerTableLoader, String threadName) {
 ExecutorService workerPool =
 ThreadPools.newWorkerPool(threadName, scanContext.planParallelism());
-try {
+plannerTableLoader.open();

Review Comment:
   I forgot to rename this variable earlier



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Spark: Fix aggregation pushdown on struct fields [iceberg]

2023-11-29 Thread via GitHub


Fokko commented on code in PR #9176:
URL: https://github.com/apache/iceberg/pull/9176#discussion_r1409850325


##
api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java:
##
@@ -30,13 +30,16 @@ protected ValueAggregate(Operation op, BoundTerm term) {
 
   @Override
   public T eval(StructLike struct) {
-return term().eval(struct);
+if (struct.size() > 1) {
+  throw new UnsupportedOperationException("Expected struct like of size 
1");

Review Comment:
   ```suggestion
 throw new UnsupportedOperationException("Expected struct like of size 
smaller or equal to 1");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Core: Add Comment field to ViewProperties [iceberg]

2023-11-29 Thread via GitHub


amogh-jahagirdar opened a new pull request, #9181:
URL: https://github.com/apache/iceberg/pull/9181

   We reference using "comment" as a property in the view spec, but it looks 
like we don't have a constant defined in the library. See 
https://github.com/trinodb/trino/pull/19818#discussion_r1408354022 for more 
details but I think "comment" should be defined in the Iceberg library as it is 
not engine specific. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Table scan using functional filters [iceberg-python]

2023-11-29 Thread via GitHub


Fokko commented on issue #170:
URL: https://github.com/apache/iceberg-python/issues/170#issuecomment-1832717377

   It looks like we need to add the BoundTransform to the left hand side: 
https://github.com/apache/iceberg/blob/d247b20f166ccb0b92443d4b05330b1e0d9c5d49/api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java#L231-L241
   
   When the table is partitioned by `day(observed_at)`, then we can use it to 
do the optimization (the transform can be dropped from both sides).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Core: Snapshot file are not correctly deleted when a snapshot is expired as part of a transaction [iceberg]

2023-11-29 Thread via GitHub


bartash opened a new issue, #9182:
URL: https://github.com/apache/iceberg/issues/9182

   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   Impala
   
   ### Please describe the bug 🐞
   
   When a snapshot is expired as part of a transaction, the snapshot file(s) 
should be deleted when the transaction commits. A recent change (PR #6634) 
ensures that files are not deleted when they have also been committed as part 
of a transaction, but this breaks the simple case where no new files are 
committed. 
   
   I have a candidate fix which I will post in a PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Core: Snapshot file are not correctly deleted when a snapshot is expired as part of a transaction [iceberg]

2023-11-29 Thread via GitHub


bartash commented on issue #9182:
URL: https://github.com/apache/iceberg/issues/9182#issuecomment-1832727577

   Can someone assign this to me please? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]

2023-11-29 Thread via GitHub


bartash opened a new pull request, #9183:
URL: https://github.com/apache/iceberg/pull/9183

   When a snapshot is expired as part of a transaction, the snapshot file(s) 
should be deleted when the transaction commits. A recent change (#6634) ensured 
that files are not deleted when they have also been committed as part of a 
transaction, but this breaks the simple case where no new files are committed. 
Fix this by not skipping deletion when the list of committed files is empty.
   
   Closes #9182 
   
   TESTING:
   
   Extended a unit test to ensure that snapshot files are deleted. Ran the test 
without the fix on a branch where #6634 was reverted to show that this is a 
regression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]

2023-11-29 Thread via GitHub


bartash commented on PR #9183:
URL: https://github.com/apache/iceberg/pull/9183#issuecomment-1832734372

   @amogh-jahagirdar can you please take a look when you have time as you 
implemented #6634


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]

2023-11-29 Thread via GitHub


amitmittal5 commented on issue #9172:
URL: https://github.com/apache/iceberg/issues/9172#issuecomment-1832775989

   I also tested with latest version, 
**iceberg-spark-runtime-3.4_2.12-1.4.2.jar** as well, I could see that the 
second number, part of the file name, is continuously increasing 
1-**3200**-11773075-523f-4667-936b-88702fe9860c-1.parquet, however 
after around 200 execution of stream, the file name got reset 
1-3166-11773075-523f-4667-936b-88702fe9860c-1.parquet and files were 
started getting overwritten.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] metadata json conflict when streaming [iceberg]

2023-11-29 Thread via GitHub


amitmittal5 commented on issue #9171:
URL: https://github.com/apache/iceberg/issues/9171#issuecomment-1832783780

   Hello,
   I am also running a spark streaming job with latest version of spark and 
iceberg, however seeing the data file is getting overwritten in subsequent 
stream execution. I have raised my issue here 
https://github.com/apache/iceberg/issues/9172, so just wondering if it is the 
same root cause for our issues. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Build: Bump mypy-boto3-glue from 1.29.2 to 1.33.0 [iceberg-python]

2023-11-29 Thread via GitHub


dependabot[bot] opened a new pull request, #171:
URL: https://github.com/apache/iceberg-python/pull/171

   Bumps [mypy-boto3-glue](https://github.com/youtype/mypy_boto3_builder) from 
1.29.2 to 1.33.0.
   
   Commits
   
   See full diff in https://github.com/youtype/mypy_boto3_builder/commits";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mypy-boto3-glue&package-manager=pip&previous-version=1.29.2&new-version=1.33.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]

2023-11-29 Thread via GitHub


amogh-jahagirdar commented on PR #9183:
URL: https://github.com/apache/iceberg/pull/9183#issuecomment-1832872079

   Thanks for the PR @bartash I'm taking a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]

2023-11-29 Thread via GitHub


amogh-jahagirdar commented on code in PR #9183:
URL: https://github.com/apache/iceberg/pull/9183#discussion_r1410002841


##
core/src/main/java/org/apache/iceberg/BaseTransaction.java:
##
@@ -446,20 +446,16 @@ private void commitSimpleTransaction() {
   }
 
   Set committedFiles = committedFiles(ops, newSnapshots);
-  if (committedFiles != null) {
-// delete all of the files that were deleted in the most recent set of 
operation commits
-Tasks.foreach(deletedFiles)
-.suppressFailureWhenFinished()
-.onFailure((file, exc) -> LOG.warn("Failed to delete uncommitted 
file: {}", file, exc))
-.run(
-path -> {
-  if (!committedFiles.contains(path)) {
-ops.io().deleteFile(path);
-  }
-});
-  } else {
-LOG.warn("Failed to load metadata for a committed snapshot, skipping 
clean-up");
-  }
+  // delete all of the files that were deleted in the most recent set of 
operation commits
+  Tasks.foreach(deletedFiles)
+  .suppressFailureWhenFinished()
+  .onFailure((file, exc) -> LOG.warn("Failed to delete uncommitted 
file: {}", file, exc))
+  .run(
+  path -> {
+if (committedFiles == null || !committedFiles.contains(path)) {

Review Comment:
   Ah yes, good catch. If there are no committed files (which would be expected 
for a transaction with just `ExpireSnapshots`) but there are files to cleanup 
(which would be expected for `ExpireSnapshots` again) then we should proceed 
with the file removal. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] How to set Spark conf to use Parquet and Iceberg tables using glue catalog without catalog name(spark_catalog)? [iceberg]

2023-11-29 Thread via GitHub


github-actions[bot] commented on issue #7748:
URL: https://github.com/apache/iceberg/issues/7748#issuecomment-1832897578

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] When using flink dataframe api for upsert, the deduplication effect is not achieved [iceberg]

2023-11-29 Thread via GitHub


github-actions[bot] commented on issue #7639:
URL: https://github.com/apache/iceberg/issues/7639#issuecomment-1832897644

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg with Hive Metastore does not create a catalog in Spark and uses default [iceberg]

2023-11-29 Thread via GitHub


github-actions[bot] closed issue #7574: Iceberg with Hive Metastore does not 
create a catalog in Spark and uses default
URL: https://github.com/apache/iceberg/issues/7574


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Iceberg with Hive Metastore does not create a catalog in Spark and uses default [iceberg]

2023-11-29 Thread via GitHub


github-actions[bot] commented on issue #7574:
URL: https://github.com/apache/iceberg/issues/7574#issuecomment-1832897692

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Expired Snapshot files in a transaction should be deleted. [iceberg]

2023-11-29 Thread via GitHub


amogh-jahagirdar commented on code in PR #9183:
URL: https://github.com/apache/iceberg/pull/9183#discussion_r1410012734


##
core/src/test/java/org/apache/iceberg/TestSequenceNumberForV2Table.java:
##
@@ -319,12 +321,18 @@ public void testExpirationInTransaction() {
 V2Assert.assertEquals("Snapshot sequence number should be 2", 2, 
snap2.sequenceNumber());
 V2Assert.assertEquals(
 "Last sequence number should be 2", 2, 
readMetadata().lastSequenceNumber());
+V2Assert.assertEquals(
+"Should be 2 snapshot files", 
listSnapshotFiles(table.location()).size(), 2);
 
 Transaction txn = table.newTransaction();
 txn.expireSnapshots().expireSnapshotId(commitId1).commit();
 txn.commitTransaction();
 V2Assert.assertEquals(
 "Last sequence number should be 2", 2, 
readMetadata().lastSequenceNumber());
+V2Assert.assertEquals(
+"Should be 1 snapshot file as 1 was deleted",

Review Comment:
   Nit: Typically we use the term "manifest list" instead of  "snapshot file" 
could we call it that? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Does iceberg flink streaming read support recover from last ckp? [iceberg]

2023-11-29 Thread via GitHub


qianzhen0 closed issue #9175: Does iceberg flink streaming read support recover 
from last ckp?
URL: https://github.com/apache/iceberg/issues/9175


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] An exception occurred while writing iceberg data through Spark: org. apache. iceberg. exceptions. CommitFailedException: metadata location has changed [iceberg]

2023-11-29 Thread via GitHub


AllenWee1106 commented on issue #9178:
URL: https://github.com/apache/iceberg/issues/9178#issuecomment-1832991874

   @nk1506 
   Thank you for your reply. 
   Based on your analysis, I have the following questions
   1. I have multiple Spark tasks running at the same time. According to your 
analysis, is it possible that among these tasks, data was written to the same 
Iceberg table at the same time, which led to the occurrence of this problem?
   
   2. In coding, I used a loop approach to execute Spark code (but the loop 
also executes one after another, not in parallel). Is this the reason for the 
problem?
   =
   
![20231130-100311](https://github.com/apache/iceberg/assets/146182256/fb1f8ef9-f1f5-4530-b093-1833d0674b3d)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support UnboundPartitionSpec [iceberg-rust]

2023-11-29 Thread via GitHub


my-vegetable-has-exploded commented on code in PR #106:
URL: https://github.com/apache/iceberg-rust/pull/106#discussion_r1410082084


##
crates/iceberg/src/spec/partition.rs:
##
@@ -60,6 +62,99 @@ impl PartitionSpec {
 }
 }
 
+static PARTITION_DATA_ID_START: i32 = 1000;
+
+/// Reference to [`UnboundPartitionSpec`].
+pub type UnboundPartitionSpecRef = Arc;
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// Unbound partition field can be built without a schema and later bound to a 
schema.
+pub struct UnboundPartitionField {
+/// A source column id from the table’s schema
+pub source_id: i32,
+/// A partition field id that is used to identify a partition field and is 
unique within a partition spec.
+/// In v2 table metadata, it is unique across all partition specs.
+pub partition_id: Option,
+/// A partition name.
+pub name: String,
+/// A transform that is applied to the source column to produce a 
partition value.
+pub transform: Transform,
+}
+
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default, 
Builder)]

Review Comment:
   I don't find a neat way to add ```#[builder(setter(each(name = 
"with_partition_field")))]``` in TypedBuilder.
   Accoding to 
[this](https://github.com/idanarye/rust-typed-builder/blob/168e076bae35399040a1c973901161023e48e13b/src/lib.rs#L218C19-L218C27),
 it may need to use mutators? 
   Maybe we can try later?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support UnboundPartitionSpec [iceberg-rust]

2023-11-29 Thread via GitHub


my-vegetable-has-exploded commented on code in PR #106:
URL: https://github.com/apache/iceberg-rust/pull/106#discussion_r1410083813


##
crates/iceberg/src/spec/partition.rs:
##
@@ -60,6 +62,99 @@ impl PartitionSpec {
 }
 }
 
+static PARTITION_DATA_ID_START: i32 = 1000;
+
+/// Reference to [`UnboundPartitionSpec`].
+pub type UnboundPartitionSpecRef = Arc;
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
+#[serde(rename_all = "kebab-case")]
+/// Unbound partition field can be built without a schema and later bound to a 
schema.
+pub struct UnboundPartitionField {
+/// A source column id from the table’s schema
+pub source_id: i32,
+/// A partition field id that is used to identify a partition field and is 
unique within a partition spec.
+/// In v2 table metadata, it is unique across all partition specs.
+pub partition_id: Option,
+/// A partition name.
+pub name: String,
+/// A transform that is applied to the source column to produce a 
partition value.
+pub transform: Transform,
+}
+
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default, 
Builder)]
+#[serde(rename_all = "kebab-case")]
+#[builder(setter(prefix = "with"))]
+/// Unbound partition spec can be built without a schema and later bound to a 
schema.
+pub struct UnboundPartitionSpec {
+/// Identifier for PartitionSpec
+pub spec_id: Option,
+/// Details of the partition spec
+#[builder(setter(each(name = "with_unbound_partition_field")))]
+pub fields: Vec,
+}
+
+impl UnboundPartitionSpec {
+/// last assigned id for partitioned field
+pub fn unpartitioned_last_assigned_id() -> i32 {
+PARTITION_DATA_ID_START - 1
+}
+
+/// Create unbound partition spec builer
+pub fn builder() -> UnboundPartitionSpecBuilder {
+UnboundPartitionSpecBuilder::default()
+}
+
+/// Bind unbound partition spec to a schema
+pub fn bind(&self, schema: SchemaRef) -> Result {

Review Comment:
   I'm also a little confused about the process. If there's a need, I can try 
it then. Thanks @liurenjie1024.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] chore: Add cargo build and build guide [iceberg-rust]

2023-11-29 Thread via GitHub


liurenjie1024 commented on code in PR #111:
URL: https://github.com/apache/iceberg-rust/pull/111#discussion_r1410084539


##
CONTRIBUTING.md:
##
@@ -108,6 +108,26 @@ $ cargo version
 cargo 1.69.0 (6e9a83356 2023-04-12)
 ```
 
+## Build
+
+### Compile
+
+```shell
+make build
+```
+
+### Lint
+
+```shell
+make check
+```
+
+### Test
+
+```shell
+make test
+```

Review Comment:
   Instead of using headers, how about using item lists:
   ```
   * To invoke a build: `make build`
   * To check code styles: `make check`
   * To run tests: `make test`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] chore: Add cargo build and build guide [iceberg-rust]

2023-11-29 Thread via GitHub


manuzhang commented on code in PR #111:
URL: https://github.com/apache/iceberg-rust/pull/111#discussion_r1410085665


##
CONTRIBUTING.md:
##
@@ -108,6 +108,26 @@ $ cargo version
 cargo 1.69.0 (6e9a83356 2023-04-12)
 ```
 
+## Build
+
+### Compile
+
+```shell
+make build
+```
+
+### Lint
+
+```shell
+make check
+```
+
+### Test
+
+```shell
+make test
+```

Review Comment:
   I'm trying to follow https://py.iceberg.apache.org/contributing/ but not 
sure yet what to add else for each command. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support UnboundPartitionSpec [iceberg-rust]

2023-11-29 Thread via GitHub


liurenjie1024 commented on code in PR #106:
URL: https://github.com/apache/iceberg-rust/pull/106#discussion_r1410088445


##
crates/iceberg/src/spec/partition.rs:
##
@@ -60,6 +60,44 @@ impl PartitionSpec {
 }
 }
 
+/// Reference to [`UnboundPartitionSpec`].
+pub type UnboundPartitionSpecRef = Arc;
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
+#[serde(rename_all = "kebab-case")]
+/// Unbound partition field can be built without a schema and later bound to a 
schema.
+pub struct UnboundPartitionField {
+/// A source column id from the table’s schema
+pub source_id: i32,
+/// A partition field id that is used to identify a partition field and is 
unique within a partition spec.
+/// In v2 table metadata, it is unique across all partition specs.
+#[builder(default, setter(strip_option))]
+pub partition_id: Option,
+/// A partition name.
+pub name: String,
+/// A transform that is applied to the source column to produce a 
partition value.
+pub transform: Transform,
+}
+
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default, 
Builder)]
+#[serde(rename_all = "kebab-case")]
+#[builder(setter(prefix = "with"))]
+/// Unbound partition spec can be built without a schema and later bound to a 
schema.
+pub struct UnboundPartitionSpec {

Review Comment:
   Could you add a test of ser/de from json for this? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] chore: Add cargo build and build guide [iceberg-rust]

2023-11-29 Thread via GitHub


liurenjie1024 commented on code in PR #111:
URL: https://github.com/apache/iceberg-rust/pull/111#discussion_r1410084539


##
CONTRIBUTING.md:
##
@@ -108,6 +108,26 @@ $ cargo version
 cargo 1.69.0 (6e9a83356 2023-04-12)
 ```
 
+## Build
+
+### Compile
+
+```shell
+make build
+```
+
+### Lint
+
+```shell
+make check
+```
+
+### Test
+
+```shell
+make test
+```

Review Comment:
   Instead of using headers, how about using item lists:
   
   * To invoke a build: `make build`
   * To check code styles: `make check`
   * To run tests: `make test`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] chore: Add cargo build and build guide [iceberg-rust]

2023-11-29 Thread via GitHub


liurenjie1024 commented on code in PR #111:
URL: https://github.com/apache/iceberg-rust/pull/111#discussion_r1410090267


##
CONTRIBUTING.md:
##
@@ -108,6 +108,26 @@ $ cargo version
 cargo 1.69.0 (6e9a83356 2023-04-12)
 ```
 
+## Build
+
+### Compile
+
+```shell
+make build
+```
+
+### Lint
+
+```shell
+make check
+```
+
+### Test
+
+```shell
+make test
+```

Review Comment:
   Currently we don't have so much commands, so I think the list format would 
be more concise? We can always enrich these when necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support UnboundPartitionSpec [iceberg-rust]

2023-11-29 Thread via GitHub


my-vegetable-has-exploded commented on code in PR #106:
URL: https://github.com/apache/iceberg-rust/pull/106#discussion_r1410103369


##
crates/iceberg/src/spec/partition.rs:
##
@@ -60,6 +60,44 @@ impl PartitionSpec {
 }
 }
 
+/// Reference to [`UnboundPartitionSpec`].
+pub type UnboundPartitionSpecRef = Arc;
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
+#[serde(rename_all = "kebab-case")]
+/// Unbound partition field can be built without a schema and later bound to a 
schema.
+pub struct UnboundPartitionField {
+/// A source column id from the table’s schema
+pub source_id: i32,
+/// A partition field id that is used to identify a partition field and is 
unique within a partition spec.
+/// In v2 table metadata, it is unique across all partition specs.
+#[builder(default, setter(strip_option))]
+pub partition_id: Option,
+/// A partition name.
+pub name: String,
+/// A transform that is applied to the source column to produce a 
partition value.
+pub transform: Transform,
+}
+
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default, 
Builder)]
+#[serde(rename_all = "kebab-case")]
+#[builder(setter(prefix = "with"))]
+/// Unbound partition spec can be built without a schema and later bound to a 
schema.
+pub struct UnboundPartitionSpec {

Review Comment:
   Sure!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Spark: Fix aggregation pushdown on struct fields [iceberg]

2023-11-29 Thread via GitHub


huaxingao commented on code in PR #9176:
URL: https://github.com/apache/iceberg/pull/9176#discussion_r1410118039


##
api/src/main/java/org/apache/iceberg/expressions/ValueAggregate.java:
##
@@ -30,13 +30,16 @@ protected ValueAggregate(Operation op, BoundTerm term) {
 
   @Override
   public T eval(StructLike struct) {
-return term().eval(struct);
+if (struct.size() > 1) {
+  throw new UnsupportedOperationException("Expected struct like of size 
1");
+}
+
+return (T) struct.get(0, term().type().typeId().javaClass());
   }
 
   @Override
   public T eval(DataFile file) {
-valueStruct.setValue(evaluateRef(file));
-return term().eval(valueStruct);
+return (T) evaluateRef(file);

Review Comment:
   Seems OK to me to just return `evaluateRef(file)`.
   
   I'd like to get @rdblue 's opinion on this as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Use avro compression properties from table properties when writing manifests and manifest lists [iceberg]

2023-11-29 Thread via GitHub


wypoon commented on PR #6799:
URL: https://github.com/apache/iceberg/pull/6799#issuecomment-1833125326

   Hmm, I think TestExpireSnapshotsAction > dataFilesCleanupWithParallelTasks 
might be a flaky test?
   All Spark 3 tests passed with Java 8 and 11, and even for Java 17, they 
passed for Scala 2.13, so it seems unlikely that there is a problem just with 
Scala 2.12 on Java 17.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Failed to create namespace using spark sql based on iceberg hadoop catalog (rest catalog) [iceberg]

2023-11-29 Thread via GitHub


TCGOGOGO commented on issue #9072:
URL: https://github.com/apache/iceberg/issues/9072#issuecomment-1833130222

   Any update for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


nk1506 commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410204384


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##
@@ -328,6 +346,14 @@ private void setHmsTableParameters(
   Set obsoleteProps,
   boolean hiveEngineEnabled,
   Map summary) {
+
+setHmsParameters(

Review Comment:
   I am not able to relate to any impacts of re-ordering. Could you please help 
me with some examples ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


nk1506 commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410206174


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##
@@ -162,8 +163,11 @@ protected void doRefresh() {
   Thread.currentThread().interrupt();
   throw new RuntimeException("Interrupted during refresh", e);
 }
-
-refreshFromMetadataLocation(metadataLocation, metadataRefreshMaxRetries);
+if (table != null && 
table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  disableRefresh();

Review Comment:
   If corresponding metadata of view type. `TableMetadataParser` will fail to 
parse. same applicable for `ViewMetadataParser`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


nk1506 commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410206641


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java:
##
@@ -250,14 +262,15 @@ protected void doCommit(TableMetadata base, TableMetadata 
metadata) {
 + "iceberg.hive.lock-heartbeat-interval-ms.",
 le);
   } catch (org.apache.hadoop.hive.metastore.api.AlreadyExistsException e) {
-throw new AlreadyExistsException(e, "Table already exists: %s.%s", 
database, tableName);
-
+throw new org.apache.iceberg.exceptions.AlreadyExistsException(

Review Comment:
   I have reverted to make it consistent with other places. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


nk1506 commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410207565


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Locale;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations implements 
HiveOperationsBase {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveViewOperations.class);
+
+  private final String fullName;
+  private final String database;
+  private final String viewName;
+  private final FileIO fileIO;
+  private final ClientPool metaClients;
+  private final long maxHiveTablePropertySize;
+
+  HiveViewOperations(
+  Configuration conf,
+  ClientPool metaClients,
+  FileIO fileIO,
+  String catalogName,
+  TableIdentifier viewIdentifier) {
+String dbName = viewIdentifier.namespace().level(0);
+this.metaClients = metaClients;
+this.fileIO = fileIO;
+this.fullName = catalogName + "." + dbName + "." + viewIdentifier.name();
+this.database = dbName;
+this.viewName = viewIdentifier.name();
+this.maxHiveTablePropertySize =
+conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, 
HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT);
+  }
+
+  @Override
+  public void doRefresh() {
+String metadataLocation = null;
+Table table = null;
+
+try {
+  table = metaClients.run(client -> client.getTable(database, viewName));
+  HiveOperationsBase.validateTableIsIceberg(table, fullName);
+  metadataLocation =
+  
table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+
+} catch (NoSuchObjectException e) {
+  if (currentMetadataLocation() != null) {
+throw new NoSuchViewException("View does not exist: %s.%s", database, 
viewName);
+  }
+} catch (TException e) {
+  String errMsg =
+  String.format("Failed to get view info from metastore %s.%s", 
database, viewName);
+  throw new RuntimeException(errMsg, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted during refresh", e);
+}
+
+if (table != null && 
!table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  disableRefresh();

Review Comment:
   if table is not of type view, `ViewMetadataParser` will fail to parse the 
corresponding json file. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact I

[PR] Core: Suppress exceptions in case of dropTableData [iceberg]

2023-11-29 Thread via GitHub


nk1506 opened a new pull request, #9184:
URL: https://github.com/apache/iceberg/pull/9184

   With 
[dropTableData](https://github.com/apache/iceberg/blob/d247b20f166ccb0b92443d4b05330b1e0d9c5d49/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L86)
 we plan to delete orphan files as many as possible. With partial failure a 
snapshot can be corrupted. Intent here is to ignore the corrupted snapshot and 
continue for the remaining snapshots and delete the related files 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Use avro compression properties from table properties when writing manifests and manifest lists [iceberg]

2023-11-29 Thread via GitHub


wypoon commented on PR #6799:
URL: https://github.com/apache/iceberg/pull/6799#issuecomment-1833208488

   @nastra @aokolnychyi I have rebased on main and resolved the conflicts with 
the `RewriteManifestsSparkAction` refactoring. As I mentioned before, I have 
introduced `ManifestWriter.Options` and `ManifestListWriter.Options` to be 
passed to `ManifestFiles.write` and `ManifestLists.write`. These `Options` 
classes provide defined options (currently just compression codec and level but 
extensible in future) that may be passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-29 Thread via GitHub


nk1506 commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1410207565


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Locale;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations implements 
HiveOperationsBase {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveViewOperations.class);
+
+  private final String fullName;
+  private final String database;
+  private final String viewName;
+  private final FileIO fileIO;
+  private final ClientPool metaClients;
+  private final long maxHiveTablePropertySize;
+
+  HiveViewOperations(
+  Configuration conf,
+  ClientPool metaClients,
+  FileIO fileIO,
+  String catalogName,
+  TableIdentifier viewIdentifier) {
+String dbName = viewIdentifier.namespace().level(0);
+this.metaClients = metaClients;
+this.fileIO = fileIO;
+this.fullName = catalogName + "." + dbName + "." + viewIdentifier.name();
+this.database = dbName;
+this.viewName = viewIdentifier.name();
+this.maxHiveTablePropertySize =
+conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, 
HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT);
+  }
+
+  @Override
+  public void doRefresh() {
+String metadataLocation = null;
+Table table = null;
+
+try {
+  table = metaClients.run(client -> client.getTable(database, viewName));
+  HiveOperationsBase.validateTableIsIceberg(table, fullName);
+  metadataLocation =
+  
table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+
+} catch (NoSuchObjectException e) {
+  if (currentMetadataLocation() != null) {
+throw new NoSuchViewException("View does not exist: %s.%s", database, 
viewName);
+  }
+} catch (TException e) {
+  String errMsg =
+  String.format("Failed to get view info from metastore %s.%s", 
database, viewName);
+  throw new RuntimeException(errMsg, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted during refresh", e);
+}
+
+if (table != null && 
!table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  disableRefresh();

Review Comment:
   if table is not of type view, `ViewMetadataParser` will fail to parse the 
corresponding json file. 
   
   Stacktrace without this change:
   `java.lang.IllegalArgumentException: Cannot parse missing string: view-uuid
at 
org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:218)
at org.apache.iceberg.util.JsonUtil.get

Re: [PR] feat: replace 'Builder' with 'TypedBuilder' for 'Snapshot' [iceberg-rust]

2023-11-29 Thread via GitHub


xiaoyang-sde commented on PR #110:
URL: https://github.com/apache/iceberg-rust/pull/110#issuecomment-1833246705

   cc @Fokko


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org