Re: [I] doc: rust.iceberg.apache.org is not resolved [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on issue #137:
URL: https://github.com/apache/iceberg-rust/issues/137#issuecomment-1876676128

   https://rust.iceberg.apache.org/ is online!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] doc: rust.iceberg.apache.org is not resolved [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo closed issue #137: doc: rust.iceberg.apache.org is not resolved
URL: https://github.com/apache/iceberg-rust/issues/137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] docs: Change homepage to rust.i.a.o [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo opened a new pull request, #146:
URL: https://github.com/apache/iceberg-rust/pull/146

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump sqlalchemy from 2.0.24 to 2.0.25 [iceberg-python]

2024-01-04 Thread via GitHub


Fokko merged PR #250:
URL: https://github.com/apache/iceberg-python/pull/250


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Bug fix falsy value of zero [iceberg-python]

2024-01-04 Thread via GitHub


Fokko merged PR #249:
URL: https://github.com/apache/iceberg-python/pull/249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] bug: The current snapshot with id 0 will be skip. [iceberg-python]

2024-01-04 Thread via GitHub


Fokko closed issue #232: bug: The current snapshot with id 0 will be skip. 
URL: https://github.com/apache/iceberg-python/issues/232


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Change homepage to rust.i.a.o [iceberg-rust]

2024-01-04 Thread via GitHub


Fokko merged PR #146:
URL: https://github.com/apache/iceberg-rust/pull/146


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Flink: Disable classloader check in TestIcebergSourceWithWatermarkExtractor to fix flakiness [iceberg]

2024-01-04 Thread via GitHub


manuzhang opened a new pull request, #9408:
URL: https://github.com/apache/iceberg/pull/9408

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo opened a new pull request, #147:
URL: https://github.com/apache/iceberg-rust/pull/147

   Part of https://github.com/apache/iceberg-rust/issues/81


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Core: Remove partition statistics files during purge table [iceberg]

2024-01-04 Thread via GitHub


ajantha-bhat opened a new pull request, #9409:
URL: https://github.com/apache/iceberg/pull/9409

   follow up from https://github.com/apache/iceberg/pull/9305
   
   cc: @dramaticlly, @amogh-jahagirdar, @nastra, @RussellSpitzer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] How to release [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on issue #81:
URL: https://github.com/apache/iceberg-rust/issues/81#issuecomment-1876737649

   ## Tasks
   
   - [ ] Add release docs: https://github.com/apache/iceberg-rust/pull/147
   - [ ] Add verify docs
   - [ ] Add CI to upload crates


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: Support reading INT96 column in row group filter [iceberg]

2024-01-04 Thread via GitHub


manuzhang commented on PR #8988:
URL: https://github.com/apache/iceberg/pull/8988#issuecomment-1876754004

   I submitted https://github.com/apache/iceberg/pull/9408 to fix failed flaky 
test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink API rewriteDataFile How to set up scanning based on file size [iceberg]

2024-01-04 Thread via GitHub


GuoZhaoY commented on issue #9386:
URL: https://github.com/apache/iceberg/issues/9386#issuecomment-1876786214

   Thank you for your answer. I have understood your answer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink API rewriteDataFile How to set up scanning based on file size [iceberg]

2024-01-04 Thread via GitHub


GuoZhaoY closed issue #9386: Flink API  rewriteDataFileHow to set up 
scanning based on file size
URL: https://github.com/apache/iceberg/issues/9386


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] docs: Add link for iceberg rust [iceberg-docs]

2024-01-04 Thread via GitHub


suyanhanx opened a new pull request, #300:
URL: https://github.com/apache/iceberg-docs/pull/300

   
![image](https://github.com/apache/iceberg-docs/assets/24221472/d6546902-9920-4e51-a37c-85a735535bd6)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Read parquet BINARY column as String for expected [iceberg]

2024-01-04 Thread via GitHub


advancedxy commented on PR #8808:
URL: https://github.com/apache/iceberg/pull/8808#issuecomment-1876803507

   Spark adds a configuration to control whether to treat binary as string in 
parquet reader, see: 
https://github.com/apache/spark/blob/c8137960a0ba725d1633795a057c68f2bbef414b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L981
   
   For this cases, the similar approach should be adapted, rather to promote 
binary to string regardless. It would be safer and it's up to the users to 
decide. WDYT @RussellSpitzer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Failed to assign splits due to the serialized split size [iceberg]

2024-01-04 Thread via GitHub


javrasya opened a new issue, #9410:
URL: https://github.com/apache/iceberg/issues/9410

   ### Apache Iceberg version
   
   1.4.2 (latest release)
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 🐞
   
   Hi there, I am trying to consume records from an Iceberg table in my Flink 
application and I am running into the following issue;
   
   ```
   Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to serialize 
splits.
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinatorContext.lambda$assignSplits$4(SourceCoordinatorContext.java:223)
at java.base/java.util.HashMap.forEach(HashMap.java:1337)
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinatorContext.lambda$assignSplits$5(SourceCoordinatorContext.java:213)
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinatorContext.callInCoordinatorThread(SourceCoordinatorContext.java:428)
... 14 more
   Caused by: java.io.UTFDataFormatException: Encoded string is too long: 123214
at 
org.apache.flink.core.memory.DataOutputSerializer.writeUTF(DataOutputSerializer.java:257)
at 
org.apache.iceberg.flink.source.split.IcebergSourceSplit.serializeV2(IcebergSourceSplit.java:150)
at 
org.apache.iceberg.flink.source.split.IcebergSourceSplitSerializer.serialize(IcebergSourceSplitSerializer.java:42)
at 
org.apache.iceberg.flink.source.split.IcebergSourceSplitSerializer.serialize(IcebergSourceSplitSerializer.java:25)
at 
org.apache.flink.runtime.source.event.AddSplitEvent.(AddSplitEvent.java:44)
at 
org.apache.flink.runtime.source.coordinator.SourceCoordinatorContext.lambda$assignSplits$4(SourceCoordinatorContext.java:220)
... 17 more
   ```
   
   Not really sure why it gets too big but  when I looked at the source code 
[here](https://github.com/apache/iceberg/blob/27e8c421358378bd80bed8b328d5b69e884b7484/flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/source/split/IcebergSourceSplit.java#L148-L151),
 it might be because there is too many file scan task in one split and that is 
why this is happening.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] how to integrations object storage ceph ? [iceberg]

2024-01-04 Thread via GitHub


jkl0898 commented on issue #7158:
URL: https://github.com/apache/iceberg/issues/7158#issuecomment-1876852333

   > Hi @jkl0898 ,
   > 
   > I am looking a solution to use Ceph with Iceberg. Currently I used MinIO 
but the we looking for an alternative solution to replace MinIO. Could you 
share technical detail how to config Ceph with Iceberg. Thanks
   
   Hi. refer to amogh-jahagirdar replied please!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Failed to assign splits due to the serialized split size [iceberg]

2024-01-04 Thread via GitHub


nastra commented on issue #9410:
URL: https://github.com/apache/iceberg/issues/9410#issuecomment-1876875354

   @stevenzwu @pvary could you guys take a look please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: New API For sequential / streaming updates [iceberg]

2024-01-04 Thread via GitHub


jasonf20 commented on PR #9323:
URL: https://github.com/apache/iceberg/pull/9323#issuecomment-1876954557

   @rdblue Sure. I added support for setting the sequence number explicitly per 
file in `MergingSnapshotProducer`. This was almost supported already (it didn't 
support per file level for added files). This is then used in a similar way to 
how `Rewrite` used it, but the main difference is that we don't set the 
sequence number when the user adds the file. 
   
   Instead, we just collect an ordered list of files and only assign the 
sequence numbers when `apply` is called. Since apply is called with the `base` 
metadata we can calculate the sequence numbers from there. If the commit fails 
due to a conflict `apply` is called again and we can re-calculate the new 
sequence numbers.  Looking again now, I think I might need to remove the 
`requiresApply` boolean from the `apply` method for retries to work properly. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Add support for reading Iceberg views [iceberg]

2024-01-04 Thread via GitHub


nastra commented on code in PR #9340:
URL: https://github.com/apache/iceberg/pull/9340#discussion_r1441669272


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkView.java:
##
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.source;
+
+import static org.apache.iceberg.TableProperties.FORMAT_VERSION;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.view.BaseView;
+import org.apache.iceberg.view.SQLViewRepresentation;
+import org.apache.iceberg.view.View;
+import org.apache.iceberg.view.ViewOperations;
+import org.apache.spark.sql.types.StructType;
+
+public class SparkView implements org.apache.spark.sql.connector.catalog.View {
+
+  private static final Set RESERVED_PROPERTIES =
+  ImmutableSet.of("provider", "location", FORMAT_VERSION);
+
+  private final View icebergView;
+  private final String catalogName;
+  private StructType lazySchema = null;
+
+  public SparkView(String catalogName, View icebergView) {
+this.catalogName = catalogName;
+this.icebergView = icebergView;
+  }
+
+  public View view() {
+return icebergView;
+  }
+
+  @Override
+  public String name() {
+return icebergView.toString();
+  }
+
+  @Override
+  public String query() {
+SQLViewRepresentation sqlRepr = icebergView.sqlFor("spark");
+Preconditions.checkState(sqlRepr != null, "Cannot load SQL for view %s", 
name());
+return sqlRepr.sql();
+  }
+
+  @Override
+  public String currentCatalog() {
+return null != icebergView.currentVersion().defaultCatalog()
+? icebergView.currentVersion().defaultCatalog()
+: catalogName;
+  }
+
+  @Override
+  public String[] currentNamespace() {
+return icebergView.currentVersion().defaultNamespace().levels();
+  }
+
+  @Override
+  public StructType schema() {
+if (null == lazySchema) {
+  this.lazySchema = SparkSchemaUtil.convert(icebergView.schema());
+}
+
+return lazySchema;
+  }
+
+  @Override
+  public String[] queryColumnNames() {
+return new String[0];
+  }
+
+  @Override
+  public String[] columnAliases() {

Review Comment:
   I'll be handling column aliases in the PR that introduces view creation via 
SQL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: Support reading INT96 column in row group filter [iceberg]

2024-01-04 Thread via GitHub


manuzhang commented on PR #8988:
URL: https://github.com/apache/iceberg/pull/8988#issuecomment-1877008303

   @nastra can we move forward with this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Bump actions/labeler from 4 to 5 [iceberg]

2024-01-04 Thread via GitHub


nastra merged PR #9331:
URL: https://github.com/apache/iceberg/pull/9331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Add support for reading Iceberg views [iceberg]

2024-01-04 Thread via GitHub


nastra commented on code in PR #9340:
URL: https://github.com/apache/iceberg/pull/9340#discussion_r1441709373


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java:
##
@@ -810,4 +823,123 @@ private Catalog.TableBuilder newBuilder(Identifier ident, 
Schema schema) {
   public Catalog icebergCatalog() {
 return icebergCatalog;
   }
+
+  @Override
+  public Identifier[] listViews(String... namespace) {
+if (null != asViewCatalog) {
+  return asViewCatalog.listViews(Namespace.of(namespace)).stream()
+  .map(ident -> Identifier.of(ident.namespace().levels(), 
ident.name()))
+  .toArray(Identifier[]::new);
+}
+
+return new Identifier[0];
+  }
+
+  @Override
+  public View loadView(Identifier ident) throws NoSuchViewException {
+if (null != asViewCatalog) {
+  try {
+org.apache.iceberg.view.View view = 
asViewCatalog.loadView(buildIdentifier(ident));
+return new SparkView(catalogName, view);
+  } catch (org.apache.iceberg.exceptions.NoSuchViewException e) {
+throw new NoSuchViewException(ident);
+  }
+}
+
+throw new NoSuchViewException(ident);
+  }
+
+  @Override
+  public View createView(
+  Identifier ident,
+  String sql,
+  String currentCatalog,
+  String[] currentNamespace,
+  StructType schema,
+  String[] queryColumnNames,

Review Comment:
   I'll be handling this in the PR that introduces view creation via SQL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Partition Evolution [iceberg-python]

2024-01-04 Thread via GitHub


Fokko commented on code in PR #245:
URL: https://github.com/apache/iceberg-python/pull/245#discussion_r1441716417


##
pyiceberg/partitioning.py:
##
@@ -215,3 +230,118 @@ def assign_fresh_partition_spec_ids(spec: PartitionSpec, 
old_schema: Schema, fre
 )
 )
 return PartitionSpec(*partition_fields, spec_id=INITIAL_PARTITION_SPEC_ID)
+
+T = TypeVar("T")
+class PartitionSpecVisitor(Generic[T], ABC):
+@abstractmethod
+def identity(field_id: int, source_name: str, source_id: int) -> T:
+"""
+Visit identity partition field
+"""
+
+@abstractmethod
+def bucket(field_id: int, source_name: str, source_id: int, num_buckets: 
int) -> T:
+"""
+Visit bucket partition field
+"""
+
+@abstractmethod
+def truncate(field_id: int, source_name: str, source_id: int, width: int) 
-> T:
+"""
+Visit truncate partition field
+"""
+
+@abstractmethod
+def year(field_id: int, source_name: str, source_id: int) -> T:
+"""
+Visit year partition field
+"""
+
+@abstractmethod
+def month(field_id: int, source_name: str, source_id: int) -> T:
+"""
+Visit month partition field
+"""
+
+@abstractmethod
+def day(field_id: int, source_name: str, source_id: int) -> T:
+"""
+Visit day partition field
+"""
+
+@abstractmethod
+def hour(field_id: int, source_name: str, source_id: int) -> T:
+"""
+Visit hour partition field
+"""
+
+@abstractmethod
+def always_null(field_id: int, source_name: str, source_id: int) -> T:
+"""
+Visit void partition field
+"""
+
+@abstractmethod
+def unknown(field_id: int, source_name: str, source_id: int, transform: 
str) -> T:
+"""
+Visit unknown partition field
+"""
+raise ValueError(f"Unknown transform {transform} is not supported")
+
+class _PartitionNameGenerator(PartitionSpecVisitor[str]):
+def identity(field_id: int, source_name: str, source_id: int) -> str:
+return source_name
+
+def bucket(field_id: int, source_name: str, source_id: int, num_buckets: 
int) -> str:
+return source_name + "_bucket_" + num_buckets
+
+def truncate(field_id: int, source_name: str, source_id: int, width: int) 
-> str:
+return source_name + "_trunc_" + width
+
+def year(field_id: int, source_name: str, source_id: int) -> str:
+return source_name + "_year"
+
+def month(field_id: int, source_name: str, source_id: int) -> str:
+return source_name + "_month"
+
+def day(field_id: int, source_name: str, source_id: int) -> str:
+return source_name + "_day"
+
+def hour(field_id: int, source_name: str, source_id: int) -> str:
+return source_name + "_hour"
+
+def always_null(field_id: int, source_name: str, source_id: int) -> str:
+return source_name + "_null"
+
+R = TypeVar("R")
+
+@staticmethod
+def _visit(spec: PartitionSpec, schema: Schema, visitor: 
PartitionSpecVisitor[R]) -> [R]:
+results = []
+for field in spec.fields:
+results.append(_visit(schema, field, visitor))
+return results
+
+@staticmethod
+def _visit(schema: Schema, field: PartitionField, visitor: 
PartitionSpecVisitor[R]) -> [R]:

Review Comment:
   It would be better to use `singledispatch` here. For example: 
   
   
https://github.com/apache/iceberg-python/blob/452238e33f2be86b9076be78db2441aea043afb8/pyiceberg/expressions/visitors.py#L139
   
   This is faster (which surprised me at first, but it pushes it down to the C 
level) and looks nicer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Partition Evolution [iceberg-python]

2024-01-04 Thread via GitHub


Fokko commented on code in PR #245:
URL: https://github.com/apache/iceberg-python/pull/245#discussion_r1441723506


##
pyiceberg/table/__init__.py:
##
@@ -1904,3 +1913,200 @@ def _generate_snapshot_id() -> int:
 snapshot_id = snapshot_id if snapshot_id >= 0 else snapshot_id * -1
 
 return snapshot_id
+
+class UpdateSpec:
+_table: Table
+_schema: Schema
+_spec: PartitionSpec
+_name_to_field: Dict[str, PartitionField]
+_name_to_added_field: Dict[str, PartitionField]
+_transform_to_field: Dict[Tuple[int, str], PartitionField]
+_transform_to_added_field: Dict[Tuple[int, str], PartitionField]
+_case_sensitive: bool
+_adds: List[PartitionField]
+_deletes: Set[int]
+_last_assigned_partition_id: int
+_renames = Dict[str, str]
+
+def __init__(
+self,
+table: Table,
+transaction: Optional[Transaction] = None,
+case_sensitive = True
+) -> None:
+self._table = table
+self._schema = table.schema()
+self._spec = table.spec()
+self._name_to_field = {field.name:field for field in self._spec.fields}
+self._name_to_added_field = {}
+self._transform_to_field = {(field.sourceId, repr(field.transform)): 
field for field in self._spec.fields}
+self._transform_to_added_field = {}
+self._adds = []
+self._deletes = {}
+self._last_assigned_partition_id = table.spec().last_assigned_field_id
+self._renames = {}
+self._transaction = transaction
+self._case_sensitive = case_sensitive
+
+
+def add_field(self, name: str, transform: Transform) -> UpdateSpec:
+ref = Reference(name)
+bound_ref = ref.bind(self._schema, self._case_sensitive)
+# verify transform can actually bind it
+output_type = bound_ref.field.field_type
+if not transform.can_transform(output_type):
+raise ValueError(f"{transform} cannot transform {output_type} 
values from {bound_ref.field.name}")
+transform_key = (bound_ref.field.field_id, transform)
+existing_partition_field = self._transform_to_field.get(transform)
+if existing_partition_field and 
self._is_duplicate_partition(transform_key[1], existing_partition_field):
+raise ValueError(f"Duplicate partition field for 
${ref.name}=${ref}, ${existing_partition_field} already exists")
+added = self._transform_to_added_field.get(transform_key)
+if added:
+raise ValueError(f"Already added partition {added}")
+new_field = self._partition_field(transform_key, name)
+if not new_field.name:
+new_field.name = _visit(self._schema, new_field, 
_PartitionNameGenerator())
+
+self._check_redundant_partitions(new_field)
+self._transform_to_added_field[transform_key] = new_field
+
+existing_partition_field = self._name_to_field.get(new_field.name)
+if existing_partition_field and not new_field.field_id in 
self._deletes:
+if isinstance(existing_partition_field.transform, VoidTransform):
+self.rename_field(existing_partition_field.name, 
existing_partition_field.name + "_" + existing_partition_field.field_id)
+else:
+raise ValueError(f"Cannot add duplicate partition field name: 
{existing_partition_field.name}")
+
+self._name_to_added_field[new_field.name] = new_field
+self._adds.append(new_field)
+return self
+
+def remove_field(self, name: str) -> UpdateSpec:
+added = self._name_to_added_field.get(name)
+if added:
+raise ValueError(f"Cannot delete newly added field {name}")
+renamed = self._renames.get(name)
+if renamed:
+raise ValueError(f"Cannot rename and delete field: {name}")
+field = self._name_to_field.get(name)
+if not field:
+raise ValueError(f"No such partition field: {name}")
+
+self._deletes.add(field.field_id)
+return self
+
+def rename_field(self, name: str, new_name: str) -> UpdateSpec:
+existing_field = self._name_to_field.get(new_name)
+if existing_field and isinstance(existing_field.transform, 
VoidTransform):
+return self.rename_field(name, name + "_" + 
existing_field.field_id)
+added = self._name_to_added_field.get(name)
+if added:
+raise ValueError(f"Cannot rename recently added partitions")
+field = self._name_to_field.get(name)
+if not field:
+raise ValueError(f"Cannot find partition field {name}")
+if field.field_id in self._deletes:
+raise ValueError(f"Cannot delete and rename partition field 
{name}")
+self._renames[name] = new_name
+return self
+
+def commit(self):

Review Comment:
   For the schema, we have a context-manager, which adds a nice touch to the 
API:
   
   ```python
   def __

Re: [PR] Partition Evolution [iceberg-python]

2024-01-04 Thread via GitHub


Fokko commented on code in PR #245:
URL: https://github.com/apache/iceberg-python/pull/245#discussion_r1441724507


##
pyiceberg/table/__init__.py:
##
@@ -1904,3 +1913,200 @@ def _generate_snapshot_id() -> int:
 snapshot_id = snapshot_id if snapshot_id >= 0 else snapshot_id * -1
 
 return snapshot_id
+
+class UpdateSpec:
+_table: Table
+_schema: Schema
+_spec: PartitionSpec
+_name_to_field: Dict[str, PartitionField]
+_name_to_added_field: Dict[str, PartitionField]
+_transform_to_field: Dict[Tuple[int, str], PartitionField]
+_transform_to_added_field: Dict[Tuple[int, str], PartitionField]
+_case_sensitive: bool
+_adds: List[PartitionField]
+_deletes: Set[int]
+_last_assigned_partition_id: int
+_renames = Dict[str, str]
+
+def __init__(
+self,
+table: Table,
+transaction: Optional[Transaction] = None,
+case_sensitive = True
+) -> None:
+self._table = table
+self._schema = table.schema()
+self._spec = table.spec()
+self._name_to_field = {field.name:field for field in self._spec.fields}
+self._name_to_added_field = {}
+self._transform_to_field = {(field.sourceId, repr(field.transform)): 
field for field in self._spec.fields}
+self._transform_to_added_field = {}
+self._adds = []
+self._deletes = {}
+self._last_assigned_partition_id = table.spec().last_assigned_field_id
+self._renames = {}
+self._transaction = transaction
+self._case_sensitive = case_sensitive
+
+
+def add_field(self, name: str, transform: Transform) -> UpdateSpec:
+ref = Reference(name)
+bound_ref = ref.bind(self._schema, self._case_sensitive)
+# verify transform can actually bind it
+output_type = bound_ref.field.field_type
+if not transform.can_transform(output_type):
+raise ValueError(f"{transform} cannot transform {output_type} 
values from {bound_ref.field.name}")
+transform_key = (bound_ref.field.field_id, transform)
+existing_partition_field = self._transform_to_field.get(transform)
+if existing_partition_field and 
self._is_duplicate_partition(transform_key[1], existing_partition_field):
+raise ValueError(f"Duplicate partition field for 
${ref.name}=${ref}, ${existing_partition_field} already exists")
+added = self._transform_to_added_field.get(transform_key)
+if added:
+raise ValueError(f"Already added partition {added}")
+new_field = self._partition_field(transform_key, name)
+if not new_field.name:
+new_field.name = _visit(self._schema, new_field, 
_PartitionNameGenerator())
+
+self._check_redundant_partitions(new_field)
+self._transform_to_added_field[transform_key] = new_field
+
+existing_partition_field = self._name_to_field.get(new_field.name)
+if existing_partition_field and not new_field.field_id in 
self._deletes:
+if isinstance(existing_partition_field.transform, VoidTransform):
+self.rename_field(existing_partition_field.name, 
existing_partition_field.name + "_" + existing_partition_field.field_id)
+else:
+raise ValueError(f"Cannot add duplicate partition field name: 
{existing_partition_field.name}")
+
+self._name_to_added_field[new_field.name] = new_field
+self._adds.append(new_field)
+return self
+
+def remove_field(self, name: str) -> UpdateSpec:
+added = self._name_to_added_field.get(name)
+if added:
+raise ValueError(f"Cannot delete newly added field {name}")
+renamed = self._renames.get(name)
+if renamed:
+raise ValueError(f"Cannot rename and delete field: {name}")
+field = self._name_to_field.get(name)
+if not field:
+raise ValueError(f"No such partition field: {name}")
+
+self._deletes.add(field.field_id)
+return self
+
+def rename_field(self, name: str, new_name: str) -> UpdateSpec:
+existing_field = self._name_to_field.get(new_name)
+if existing_field and isinstance(existing_field.transform, 
VoidTransform):
+return self.rename_field(name, name + "_" + 
existing_field.field_id)
+added = self._name_to_added_field.get(name)
+if added:
+raise ValueError(f"Cannot rename recently added partitions")
+field = self._name_to_field.get(name)
+if not field:
+raise ValueError(f"Cannot find partition field {name}")
+if field.field_id in self._deletes:
+raise ValueError(f"Cannot delete and rename partition field 
{name}")
+self._renames[name] = new_name
+return self
+
+def commit(self):
+new_spec = self._apply()
+updates = []
+requirements = []
+if self._table.metadata.default_spec_i

Re: [PR] Partition Evolution [iceberg-python]

2024-01-04 Thread via GitHub


Fokko commented on code in PR #245:
URL: https://github.com/apache/iceberg-python/pull/245#discussion_r1441725509


##
pyiceberg/table/__init__.py:
##
@@ -1904,3 +1913,200 @@ def _generate_snapshot_id() -> int:
 snapshot_id = snapshot_id if snapshot_id >= 0 else snapshot_id * -1
 
 return snapshot_id
+
+class UpdateSpec:
+_table: Table
+_schema: Schema
+_spec: PartitionSpec
+_name_to_field: Dict[str, PartitionField]
+_name_to_added_field: Dict[str, PartitionField]
+_transform_to_field: Dict[Tuple[int, str], PartitionField]
+_transform_to_added_field: Dict[Tuple[int, str], PartitionField]
+_case_sensitive: bool
+_adds: List[PartitionField]
+_deletes: Set[int]
+_last_assigned_partition_id: int
+_renames = Dict[str, str]
+
+def __init__(
+self,
+table: Table,
+transaction: Optional[Transaction] = None,
+case_sensitive = True
+) -> None:
+self._table = table
+self._schema = table.schema()
+self._spec = table.spec()
+self._name_to_field = {field.name:field for field in self._spec.fields}
+self._name_to_added_field = {}
+self._transform_to_field = {(field.sourceId, repr(field.transform)): 
field for field in self._spec.fields}
+self._transform_to_added_field = {}
+self._adds = []
+self._deletes = {}
+self._last_assigned_partition_id = table.spec().last_assigned_field_id
+self._renames = {}
+self._transaction = transaction
+self._case_sensitive = case_sensitive
+
+
+def add_field(self, name: str, transform: Transform) -> UpdateSpec:
+ref = Reference(name)
+bound_ref = ref.bind(self._schema, self._case_sensitive)
+# verify transform can actually bind it
+output_type = bound_ref.field.field_type
+if not transform.can_transform(output_type):
+raise ValueError(f"{transform} cannot transform {output_type} 
values from {bound_ref.field.name}")
+transform_key = (bound_ref.field.field_id, transform)
+existing_partition_field = self._transform_to_field.get(transform)
+if existing_partition_field and 
self._is_duplicate_partition(transform_key[1], existing_partition_field):
+raise ValueError(f"Duplicate partition field for 
${ref.name}=${ref}, ${existing_partition_field} already exists")
+added = self._transform_to_added_field.get(transform_key)
+if added:
+raise ValueError(f"Already added partition {added}")
+new_field = self._partition_field(transform_key, name)
+if not new_field.name:
+new_field.name = _visit(self._schema, new_field, 
_PartitionNameGenerator())
+
+self._check_redundant_partitions(new_field)
+self._transform_to_added_field[transform_key] = new_field
+
+existing_partition_field = self._name_to_field.get(new_field.name)
+if existing_partition_field and not new_field.field_id in 
self._deletes:
+if isinstance(existing_partition_field.transform, VoidTransform):
+self.rename_field(existing_partition_field.name, 
existing_partition_field.name + "_" + existing_partition_field.field_id)
+else:
+raise ValueError(f"Cannot add duplicate partition field name: 
{existing_partition_field.name}")
+
+self._name_to_added_field[new_field.name] = new_field
+self._adds.append(new_field)
+return self
+
+def remove_field(self, name: str) -> UpdateSpec:
+added = self._name_to_added_field.get(name)
+if added:
+raise ValueError(f"Cannot delete newly added field {name}")
+renamed = self._renames.get(name)
+if renamed:
+raise ValueError(f"Cannot rename and delete field: {name}")
+field = self._name_to_field.get(name)
+if not field:
+raise ValueError(f"No such partition field: {name}")
+
+self._deletes.add(field.field_id)
+return self
+
+def rename_field(self, name: str, new_name: str) -> UpdateSpec:
+existing_field = self._name_to_field.get(new_name)
+if existing_field and isinstance(existing_field.transform, 
VoidTransform):
+return self.rename_field(name, name + "_" + 
existing_field.field_id)
+added = self._name_to_added_field.get(name)
+if added:
+raise ValueError(f"Cannot rename recently added partitions")
+field = self._name_to_field.get(name)
+if not field:
+raise ValueError(f"Cannot find partition field {name}")
+if field.field_id in self._deletes:
+raise ValueError(f"Cannot delete and rename partition field 
{name}")
+self._renames[name] = new_name
+return self
+
+def commit(self):
+new_spec = self._apply()
+updates = []
+requirements = []
+if self._table.metadata.default_spec_i

Re: [PR] Python: Add support for Python 3.12 [iceberg-python]

2024-01-04 Thread via GitHub


MehulBatra commented on PR #35:
URL: https://github.com/apache/iceberg-python/pull/35#issuecomment-1877068534

   Looking into this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Partition Evolution [iceberg-python]

2024-01-04 Thread via GitHub


Fokko commented on code in PR #245:
URL: https://github.com/apache/iceberg-python/pull/245#discussion_r1441727718


##
pyiceberg/table/__init__.py:
##
@@ -1904,3 +1913,200 @@ def _generate_snapshot_id() -> int:
 snapshot_id = snapshot_id if snapshot_id >= 0 else snapshot_id * -1
 
 return snapshot_id
+
+class UpdateSpec:
+_table: Table
+_schema: Schema
+_spec: PartitionSpec
+_name_to_field: Dict[str, PartitionField]
+_name_to_added_field: Dict[str, PartitionField]
+_transform_to_field: Dict[Tuple[int, str], PartitionField]
+_transform_to_added_field: Dict[Tuple[int, str], PartitionField]
+_case_sensitive: bool
+_adds: List[PartitionField]
+_deletes: Set[int]
+_last_assigned_partition_id: int
+_renames = Dict[str, str]
+
+def __init__(
+self,
+table: Table,
+transaction: Optional[Transaction] = None,
+case_sensitive = True
+) -> None:
+self._table = table
+self._schema = table.schema()
+self._spec = table.spec()
+self._name_to_field = {field.name:field for field in self._spec.fields}
+self._name_to_added_field = {}
+self._transform_to_field = {(field.sourceId, repr(field.transform)): 
field for field in self._spec.fields}
+self._transform_to_added_field = {}
+self._adds = []
+self._deletes = {}
+self._last_assigned_partition_id = table.spec().last_assigned_field_id
+self._renames = {}
+self._transaction = transaction
+self._case_sensitive = case_sensitive
+
+
+def add_field(self, name: str, transform: Transform) -> UpdateSpec:
+ref = Reference(name)
+bound_ref = ref.bind(self._schema, self._case_sensitive)
+# verify transform can actually bind it
+output_type = bound_ref.field.field_type
+if not transform.can_transform(output_type):
+raise ValueError(f"{transform} cannot transform {output_type} 
values from {bound_ref.field.name}")
+transform_key = (bound_ref.field.field_id, transform)
+existing_partition_field = self._transform_to_field.get(transform)
+if existing_partition_field and 
self._is_duplicate_partition(transform_key[1], existing_partition_field):
+raise ValueError(f"Duplicate partition field for 
${ref.name}=${ref}, ${existing_partition_field} already exists")
+added = self._transform_to_added_field.get(transform_key)
+if added:
+raise ValueError(f"Already added partition {added}")
+new_field = self._partition_field(transform_key, name)
+if not new_field.name:
+new_field.name = _visit(self._schema, new_field, 
_PartitionNameGenerator())
+
+self._check_redundant_partitions(new_field)
+self._transform_to_added_field[transform_key] = new_field
+
+existing_partition_field = self._name_to_field.get(new_field.name)
+if existing_partition_field and not new_field.field_id in 
self._deletes:
+if isinstance(existing_partition_field.transform, VoidTransform):
+self.rename_field(existing_partition_field.name, 
existing_partition_field.name + "_" + existing_partition_field.field_id)
+else:
+raise ValueError(f"Cannot add duplicate partition field name: 
{existing_partition_field.name}")
+
+self._name_to_added_field[new_field.name] = new_field
+self._adds.append(new_field)
+return self
+
+def remove_field(self, name: str) -> UpdateSpec:
+added = self._name_to_added_field.get(name)
+if added:
+raise ValueError(f"Cannot delete newly added field {name}")
+renamed = self._renames.get(name)
+if renamed:
+raise ValueError(f"Cannot rename and delete field: {name}")
+field = self._name_to_field.get(name)
+if not field:
+raise ValueError(f"No such partition field: {name}")
+
+self._deletes.add(field.field_id)
+return self
+
+def rename_field(self, name: str, new_name: str) -> UpdateSpec:
+existing_field = self._name_to_field.get(new_name)
+if existing_field and isinstance(existing_field.transform, 
VoidTransform):
+return self.rename_field(name, name + "_" + 
existing_field.field_id)
+added = self._name_to_added_field.get(name)
+if added:
+raise ValueError(f"Cannot rename recently added partitions")
+field = self._name_to_field.get(name)
+if not field:
+raise ValueError(f"Cannot find partition field {name}")
+if field.field_id in self._deletes:
+raise ValueError(f"Cannot delete and rename partition field 
{name}")
+self._renames[name] = new_name
+return self
+
+def commit(self):
+new_spec = self._apply()
+updates = []
+requirements = []
+if self._table.metadata.default_spec_i

Re: [PR] Core: Add a util to read and write partition stats [iceberg]

2024-01-04 Thread via GitHub


ajantha-bhat commented on code in PR #9170:
URL: https://github.com/apache/iceberg/pull/9170#discussion_r1407500365


##
core/src/main/java/org/apache/iceberg/Partition.java:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Objects;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.types.Types;
+
+public class Partition implements IndexedRecord {
+  private PartitionData partitionData;

Review Comment:
   Fields are based on the spec. 
   
https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file



##
core/src/main/java/org/apache/iceberg/Partition.java:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Objects;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.types.Types;
+
+public class Partition implements IndexedRecord {
+  private PartitionData partitionData;
+  private int specId;
+  private long dataRecordCount;
+  private int dataFileCount;
+  private long dataFileSizeInBytes;
+  private long posDeleteRecordCount;
+  private int posDeleteFileCount;
+  private long eqDeleteRecordCount;
+  private int eqDeleteFileCount;
+  // Optional accurate count of records in a partition after applying the 
delete files if any
+  private long totalRecordCount;
+  // Commit time of snapshot that last updated this partition
+  private long lastUpdatedAt;
+  // ID of snapshot that last updated this partition
+  private long lastUpdatedSnapshotId;
+
+  public enum Column {
+PARTITION_DATA,
+SPEC_ID,
+DATA_RECORD_COUNT,
+DATA_FILE_COUNT,
+DATA_FILE_SIZE_IN_BYTES,
+POSITION_DELETE_RECORD_COUNT,
+POSITION_DELETE_FILE_COUNT,
+EQUALITY_DELETE_RECORD_COUNT,
+EQUALITY_DELETE_FILE_COUNT,
+TOTAL_RECORD_COUNT,
+LAST_UPDATED_AT,
+LAST_UPDATED_SNAPSHOT_ID
+  }
+
+  public Partition() {}
+
+  public static Builder builder() {
+return new Builder();
+  }
+
+  public PartitionData partitionData() {
+return partitionData;
+  }
+
+  public int specId() {
+return specId;
+  }
+
+  public long dataRecordCount() {
+return dataRecordCount;
+  }
+
+  public int dataFileCount() {
+return dataFileCount;
+  }
+
+  public long dataFileSizeInBytes() {
+return dataFileSizeInBytes;
+  }
+
+  public Long posDeleteRecordCount() {
+return posDeleteRecordCount;
+  }
+
+  public Integer posDeleteFileCount() {
+return posDeleteFileCount;
+  }
+
+  public Long eqDeleteRecordCount() {
+return eqDeleteRecordCount;
+  }
+
+  public Integer eqDeleteFileCount() {
+return eqDeleteFileCount;
+  }
+
+  public Long totalRecordCount() {
+return totalRecordCount;
+  }
+
+  public Long lastUpdatedAt() {
+return lastUpdatedAt;
+  }
+
+  public Long lastUpdatedSnapshotId() {
+return lastUpdatedSnapshotId;
+  }
+
+  @Override
+  public void put(int i, Object v) {

Review Comment:
   Extends `IndexedRecord` to use the existing Parquet/ORC Avro reader and 
writer from the `iceberg-parquet` and `iceberg-orc` module.



##
core/src/main/java/org/apache/iceberg/Partition.java:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Soft

Re: [PR] Core: Add a util to read and write partition stats [iceberg]

2024-01-04 Thread via GitHub


ajantha-bhat commented on code in PR #9170:
URL: https://github.com/apache/iceberg/pull/9170#discussion_r1407505119


##
core/src/main/java/org/apache/iceberg/Partition.java:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Objects;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.types.Types;
+
+public class Partition implements IndexedRecord {
+  private PartitionData partitionData;
+  private int specId;
+  private long dataRecordCount;
+  private int dataFileCount;
+  private long dataFileSizeInBytes;
+  private long posDeleteRecordCount;
+  private int posDeleteFileCount;
+  private long eqDeleteRecordCount;
+  private int eqDeleteFileCount;
+  // Optional accurate count of records in a partition after applying the 
delete files if any
+  private long totalRecordCount;
+  // Commit time of snapshot that last updated this partition
+  private long lastUpdatedAt;
+  // ID of snapshot that last updated this partition
+  private long lastUpdatedSnapshotId;
+
+  public enum Column {
+PARTITION_DATA,
+SPEC_ID,
+DATA_RECORD_COUNT,
+DATA_FILE_COUNT,
+DATA_FILE_SIZE_IN_BYTES,
+POSITION_DELETE_RECORD_COUNT,
+POSITION_DELETE_FILE_COUNT,
+EQUALITY_DELETE_RECORD_COUNT,
+EQUALITY_DELETE_FILE_COUNT,
+TOTAL_RECORD_COUNT,
+LAST_UPDATED_AT,
+LAST_UPDATED_SNAPSHOT_ID
+  }
+
+  public Partition() {}
+
+  public static Builder builder() {
+return new Builder();
+  }
+
+  public PartitionData partitionData() {
+return partitionData;
+  }
+
+  public int specId() {
+return specId;
+  }
+
+  public long dataRecordCount() {
+return dataRecordCount;
+  }
+
+  public int dataFileCount() {
+return dataFileCount;
+  }
+
+  public long dataFileSizeInBytes() {
+return dataFileSizeInBytes;
+  }
+
+  public Long posDeleteRecordCount() {
+return posDeleteRecordCount;
+  }
+
+  public Integer posDeleteFileCount() {
+return posDeleteFileCount;
+  }
+
+  public Long eqDeleteRecordCount() {
+return eqDeleteRecordCount;
+  }
+
+  public Integer eqDeleteFileCount() {
+return eqDeleteFileCount;
+  }
+
+  public Long totalRecordCount() {
+return totalRecordCount;
+  }
+
+  public Long lastUpdatedAt() {
+return lastUpdatedAt;
+  }
+
+  public Long lastUpdatedSnapshotId() {
+return lastUpdatedSnapshotId;
+  }
+
+  @Override
+  public void put(int i, Object v) {
+switch (i) {
+  case 0:
+this.partitionData = (PartitionData) v;
+return;
+  case 1:
+this.specId = (int) v;
+return;
+  case 2:
+this.dataRecordCount = (long) v;
+return;
+  case 3:
+this.dataFileCount = (int) v;
+return;
+  case 4:
+this.dataFileSizeInBytes = (long) v;
+return;
+  case 5:
+this.posDeleteRecordCount = (long) v;
+return;
+  case 6:
+this.posDeleteFileCount = (int) v;
+return;
+  case 7:
+this.eqDeleteRecordCount = (long) v;
+return;
+  case 8:
+this.eqDeleteFileCount = (int) v;
+return;
+  case 9:
+this.totalRecordCount = (long) v;
+return;
+  case 10:
+this.lastUpdatedAt = (long) v;
+return;
+  case 11:
+this.lastUpdatedSnapshotId = (long) v;
+return;
+  default:
+throw new UnsupportedOperationException("Unknown field ordinal: " + i);
+}
+  }
+
+  @Override
+  public Object get(int i) {
+switch (i) {
+  case 0:
+return partitionData;
+  case 1:
+return specId;
+  case 2:
+return dataRecordCount;
+  case 3:
+return dataFileCount;
+  case 4:
+return dataFileSizeInBytes;
+  case 5:
+return posDeleteRecordCount;
+  case 6:
+return posDeleteFileCount;
+  case 7:
+return eqDeleteRecordCount;
+  case 8:
+return eqDeleteFileCount;
+  case 9:
+return totalRecordCount;
+  case 10:
+return lastUpdatedAt;
+  case 11:
+return lastUp

Re: [PR] Core: Add a util to read and write partition stats [iceberg]

2024-01-04 Thread via GitHub


ajantha-bhat commented on code in PR #9170:
URL: https://github.com/apache/iceberg/pull/9170#discussion_r1441737877


##
core/src/main/java/org/apache/iceberg/PartitionEntry.java:
##
@@ -0,0 +1,361 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Objects;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.types.Types;
+
+public class PartitionEntry implements IndexedRecord {
+  private PartitionData partitionData;

Review Comment:
   Fields are based on the spec.
   
https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file



##
core/src/main/java/org/apache/iceberg/PartitionEntry.java:
##
@@ -0,0 +1,361 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Objects;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.types.Types;
+
+public class PartitionEntry implements IndexedRecord {
+  private PartitionData partitionData;
+  private int specId;
+  private long dataRecordCount;
+  private int dataFileCount;
+  private long dataFileSizeInBytes;
+  private long posDeleteRecordCount;
+  private int posDeleteFileCount;
+  private long eqDeleteRecordCount;
+  private int eqDeleteFileCount;
+  // Optional accurate count of records in a partition after applying the 
delete files if any
+  private long totalRecordCount;
+  // Commit time of snapshot that last updated this partition
+  private long lastUpdatedAt;
+  // ID of snapshot that last updated this partition
+  private long lastUpdatedSnapshotId;
+
+  public enum Column {
+PARTITION_DATA,
+SPEC_ID,
+DATA_RECORD_COUNT,
+DATA_FILE_COUNT,
+DATA_FILE_SIZE_IN_BYTES,
+POSITION_DELETE_RECORD_COUNT,
+POSITION_DELETE_FILE_COUNT,
+EQUALITY_DELETE_RECORD_COUNT,
+EQUALITY_DELETE_FILE_COUNT,
+TOTAL_RECORD_COUNT,
+LAST_UPDATED_AT,
+LAST_UPDATED_SNAPSHOT_ID
+  }
+
+  private PartitionEntry() {}
+
+  public static Builder builder() {
+return new Builder();
+  }
+
+  public PartitionData partitionData() {
+return partitionData;
+  }
+
+  public int specId() {
+return specId;
+  }
+
+  public long dataRecordCount() {
+return dataRecordCount;
+  }
+
+  public int dataFileCount() {
+return dataFileCount;
+  }
+
+  public long dataFileSizeInBytes() {
+return dataFileSizeInBytes;
+  }
+
+  public long posDeleteRecordCount() {
+return posDeleteRecordCount;
+  }
+
+  public int posDeleteFileCount() {
+return posDeleteFileCount;
+  }
+
+  public long eqDeleteRecordCount() {
+return eqDeleteRecordCount;
+  }
+
+  public int eqDeleteFileCount() {
+return eqDeleteFileCount;
+  }
+
+  public long totalRecordCount() {
+return totalRecordCount;
+  }
+
+  public long lastUpdatedAt() {
+return lastUpdatedAt;
+  }
+
+  public long lastUpdatedSnapshotId() {
+return lastUpdatedSnapshotId;
+  }
+
+  @Override
+  public void put(int i, Object v) {
+switch (i) {
+  case 0:
+this.partitionData = (PartitionData) v;
+return;
+  case 1:
+this.specId = (int) v;
+return;
+  case 2:
+this.dataRecordCount = (long) v;
+return;
+  case 3:
+this.da

Re: [PR] Core: Add a util to read and write partition stats [iceberg]

2024-01-04 Thread via GitHub


ajantha-bhat commented on code in PR #9170:
URL: https://github.com/apache/iceberg/pull/9170#discussion_r1441749823


##
core/src/main/java/org/apache/iceberg/PartitionEntry.java:
##
@@ -0,0 +1,361 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.Objects;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.iceberg.avro.AvroSchemaUtil;
+import org.apache.iceberg.types.Types;
+
+public class PartitionEntry implements IndexedRecord {
+  private PartitionData partitionData;
+  private int specId;
+  private long dataRecordCount;
+  private int dataFileCount;
+  private long dataFileSizeInBytes;
+  private long posDeleteRecordCount;
+  private int posDeleteFileCount;
+  private long eqDeleteRecordCount;
+  private int eqDeleteFileCount;
+  // Optional accurate count of records in a partition after applying the 
delete files if any
+  private long totalRecordCount;
+  // Commit time of snapshot that last updated this partition
+  private long lastUpdatedAt;
+  // ID of snapshot that last updated this partition
+  private long lastUpdatedSnapshotId;
+
+  public enum Column {
+PARTITION_DATA,
+SPEC_ID,
+DATA_RECORD_COUNT,
+DATA_FILE_COUNT,
+DATA_FILE_SIZE_IN_BYTES,
+POSITION_DELETE_RECORD_COUNT,
+POSITION_DELETE_FILE_COUNT,
+EQUALITY_DELETE_RECORD_COUNT,
+EQUALITY_DELETE_FILE_COUNT,
+TOTAL_RECORD_COUNT,
+LAST_UPDATED_AT,
+LAST_UPDATED_SNAPSHOT_ID
+  }
+
+  private PartitionEntry() {}
+
+  public static Builder builder() {
+return new Builder();
+  }
+
+  public PartitionData partitionData() {
+return partitionData;
+  }
+
+  public int specId() {
+return specId;
+  }
+
+  public long dataRecordCount() {
+return dataRecordCount;
+  }
+
+  public int dataFileCount() {
+return dataFileCount;
+  }
+
+  public long dataFileSizeInBytes() {
+return dataFileSizeInBytes;
+  }
+
+  public long posDeleteRecordCount() {
+return posDeleteRecordCount;
+  }
+
+  public int posDeleteFileCount() {
+return posDeleteFileCount;
+  }
+
+  public long eqDeleteRecordCount() {
+return eqDeleteRecordCount;
+  }
+
+  public int eqDeleteFileCount() {
+return eqDeleteFileCount;
+  }
+
+  public long totalRecordCount() {
+return totalRecordCount;
+  }
+
+  public long lastUpdatedAt() {
+return lastUpdatedAt;
+  }
+
+  public long lastUpdatedSnapshotId() {
+return lastUpdatedSnapshotId;
+  }
+
+  @Override
+  public void put(int i, Object v) {
+switch (i) {
+  case 0:
+this.partitionData = (PartitionData) v;
+return;
+  case 1:
+this.specId = (int) v;
+return;
+  case 2:
+this.dataRecordCount = (long) v;
+return;
+  case 3:
+this.dataFileCount = (int) v;
+return;
+  case 4:
+this.dataFileSizeInBytes = (long) v;
+return;
+  case 5:
+this.posDeleteRecordCount = (long) v;
+return;
+  case 6:
+this.posDeleteFileCount = (int) v;
+return;
+  case 7:
+this.eqDeleteRecordCount = (long) v;
+return;
+  case 8:
+this.eqDeleteFileCount = (int) v;
+return;
+  case 9:
+this.totalRecordCount = (long) v;
+return;
+  case 10:
+this.lastUpdatedAt = (long) v;
+return;
+  case 11:
+this.lastUpdatedSnapshotId = (long) v;
+return;
+  default:
+throw new UnsupportedOperationException("Unknown field ordinal: " + i);
+}
+  }
+
+  @Override
+  public Object get(int i) {
+switch (i) {
+  case 0:
+return partitionData;
+  case 1:
+return specId;
+  case 2:
+return dataRecordCount;
+  case 3:
+return dataFileCount;
+  case 4:
+return dataFileSizeInBytes;
+  case 5:
+return posDeleteRecordCount;
+  case 6:
+return posDeleteFileCount;
+  case 7:
+return eqDeleteRecordCount;
+  case 8:
+return eqDeleteFileCount;
+  case 9:
+return totalRecordCount;
+  case 10:
+return lastUpdatedAt;
+  case 11:
+retur

Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-04 Thread via GitHub


ggershinsky commented on code in PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#discussion_r1441763348


##
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java:
##
@@ -161,7 +162,12 @@ private StructType lazyPosDeleteSparkType() {
   }
 
   @Override
-  public FileAppender newAppender(OutputFile file, FileFormat 
fileFormat) {
+  public FileAppender newAppender(OutputFile outputFile, 
FileFormat format) {
+return newAppender(EncryptionUtil.plainAsEncryptedOutput(outputFile), 
format);
+  }
+
+  @Override
+  public FileAppender newAppender(EncryptedOutputFile file, 
FileFormat fileFormat) {

Review Comment:
   Sounds good, I'll move them to 3.5.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Disable classloader check in TestIcebergSourceWithWatermarkExtractor to fix flakiness [iceberg]

2024-01-04 Thread via GitHub


pvary merged PR #9408:
URL: https://github.com/apache/iceberg/pull/9408


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Disable classloader check in TestIcebergSourceWithWatermarkExtractor to fix flakiness [iceberg]

2024-01-04 Thread via GitHub


pvary commented on PR #9408:
URL: https://github.com/apache/iceberg/pull/9408#issuecomment-1877190593

   Thanks @manuzhang for the fix.
   CC: @stevenzwu


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Use SupportsPrefixOperations for Remove OrphanFile Procedure [iceberg]

2024-01-04 Thread via GitHub


domonkosbalogh-seon commented on PR #7914:
URL: https://github.com/apache/iceberg/pull/7914#issuecomment-1877212931

   Ran into a similar issue (same as in 
https://github.com/apache/iceberg/issues/8368) using the Glue Catalog. Is there 
maybe a workaround to this, or this PR would be the only fix?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Failed to assign splits due to the serialized split size [iceberg]

2024-01-04 Thread via GitHub


pvary commented on issue #9410:
URL: https://github.com/apache/iceberg/issues/9410#issuecomment-1877233309

   @stevenzwu: After a quick check, I have found this:
   
https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/core/memory/DataOutputSerializer.java#L256-L260
   ```
   if (utflen > 65535) {
   throw new UTFDataFormatException("Encoded string is too long: " 
+ utflen);
   } else if (this.position > this.buffer.length - utflen - 2) {
   resize(utflen + 2);
   }
   ```
   
   This means that anything which is above 64k could not be serialized by 
`DataOutputSerializer.writeUTF`, which seems a bit arbitrary limit for me.
   
   We could use `DataOutputSerializer.writeChars` which uses 
`DataOutputSerializer.writeChar`. The downside is that it is less effective if 
we use simple chars (between `0x0001` and ` 0x007F`): See: 
https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/core/memory/DataOutputSerializer.java#L245C1-L254C10
   ```
   for (int i = 0; i < strlen; i++) {
   c = str.charAt(i);
   if ((c >= 0x0001) && (c <= 0x007F)) {
   utflen++;
   } else if (c > 0x07FF) {
   utflen += 3;
   } else {
   utflen += 2;
   }
   }
   ```
   
   The upside that it should work regardless of the size of the string.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Set log level to WARN for rewrite task failure with partial progress [iceberg]

2024-01-04 Thread via GitHub


manuzhang commented on PR #9400:
URL: https://github.com/apache/iceberg/pull/9400#issuecomment-1877320865

   @ajantha-bhat please help review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Create JUnit5 version of TableTestBase [iceberg]

2024-01-04 Thread via GitHub


lisirrx commented on PR #9217:
URL: https://github.com/apache/iceberg/pull/9217#issuecomment-1877336341

   @nastra Sorry to continue working so late. I got COVID last month and worked 
on my graduate school applications.
   I add the `TestBase` class and give 2 examples using it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2024-01-04 Thread via GitHub


pvary commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1441980492


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,287 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Collections;
+import java.util.Locale;
+import java.util.Objects;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreCatalog;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchIcebergViewException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.util.MetastoreOperationsUtil;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations implements 
HiveOperationsBase {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveViewOperations.class);
+
+  private final String fullName;
+  private final String database;
+  private final String viewName;
+  private final FileIO fileIO;
+  private final ClientPool metaClients;
+  private final long maxHiveTablePropertySize;
+
+  HiveViewOperations(
+  Configuration conf,
+  ClientPool metaClients,
+  FileIO fileIO,
+  String catalogName,
+  TableIdentifier viewIdentifier) {
+String dbName = viewIdentifier.namespace().level(0);
+this.metaClients = metaClients;
+this.fileIO = fileIO;
+this.fullName = BaseMetastoreCatalog.fullTableName(catalogName, 
viewIdentifier);
+this.database = dbName;
+this.viewName = viewIdentifier.name();
+this.maxHiveTablePropertySize =
+conf.getLong(HIVE_TABLE_PROPERTY_MAX_SIZE, 
HIVE_TABLE_PROPERTY_MAX_SIZE_DEFAULT);
+  }
+
+  @Override
+  public void doRefresh() {
+String metadataLocation = null;
+Table table = null;
+
+try {
+  table = metaClients.run(client -> client.getTable(database, viewName));
+  HiveOperationsBase.validateTableOrViewIsIceberg(table, fullName);
+  metadataLocation =
+  
table.getParameters().get(BaseMetastoreTableOperations.METADATA_LOCATION_PROP);
+
+} catch (NoSuchObjectException e) {
+  if (currentMetadataLocation() != null) {
+throw new NoSuchViewException("View does not exist: %s.%s", database, 
viewName);
+  }
+} catch (TException e) {
+  String errMsg =
+  String.format("Failed to get view info from metastore %s.%s", 
database, viewName);
+  throw new RuntimeException(errMsg, e);
+} catch (InterruptedException e) {
+  Thread.currentThread().interrupt();
+  throw new RuntimeException("Interrupted during refresh", e);
+}
+
+if (table != null && 
!table.getTableType().equalsIgnoreCase(TableType.VIRTUAL_VIEW.name())) {
+  disableRefresh();
+} else {
+  refreshFromMetadataLocation(metadataLocation);
+}
+  }
+
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")

Review Comment:
   Would it worth to refactor this method to decrease the 
`CyclomaticComplexity`?



##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -264,6 +257,159 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 }
   }
 
+  @Override
+  public boolean dropView(TableIdentifier i

[I] Iceberg Glue Concurrent Update can result in missing metadata_location [iceberg]

2024-01-04 Thread via GitHub


shaeqahmed opened a new issue, #9411:
URL: https://github.com/apache/iceberg/issues/9411

   ### Apache Iceberg version
   
   1.4.2 (latest release)
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   Similar issue that i found that was supposed to be fixed in older version: 
https://github.com/apache/iceberg/issues/7151
   
   We have a Java Iceberg Code that processes from a FIFO queue and does 
commits to Iceberg in single threaded fashion. I have confirmed that we are not 
making commits anywhere to a table at the same time. However, when doing a few 
commits back to back in a row, at some point we encountered the following WARN 
log indicating that Glue detected a concurrent update, and it was retrying:
   
   ```
   Retrying task after failure: Cannot commit 
glue_catalog.matano.cloudflare_http_request because Glue detected concurrent 
update org.apache.iceberg.exceptions.CommitFailedException: Cannot commit 
glue_catalog.matano.cloudflare_http_request because Glue detected concurrent 
update at 
org.apache.iceberg.aws.glue.GlueTableOperations.handleAWSExceptions(GlueTableOperations.java:355)
 ~[output.jar:?] at 
org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:180)
 
   ...
   ```
   
   But immediately after this log, while attempting to refresh the Iceberg 
metadata there is a iceberg NotFoundException as the current metadata location 
doesn't exist or no longer exists.
   
   ```
   INFO BaseMetastoreTableOperations - Refreshing table metadata from new 
version: 
s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json
   
   ERROR IcebergMetadataWriter - 
org.apache.iceberg.exceptions.NotFoundException: Location does not exist: 
s3://redacted-bucket/lake/cloudflare_http_request/metadata/xxx-e3e8a38dbdc4.metadata.json
   ```
   
   **This has resulted in our table becoming corrupt and the availability of 
our data lake service being effected until we manually fixed the table by 
refrencing the Glue `previous_metadata_location` and overriding the invalid 
current `metadata_location` with it.**
   
   It looks to me that when experiencing a CommitFailedException (CFE) these 
are retried internally and in any case should not result in a corrupt table 
even if all tried fail. Our code looks as follows, as we catch all exceptions: 
   
   ```
   // tableObj is our class, and a thin wrapper containing the Iceberg Java 
Table class
   
   logger.info("Committing for tables: ${tableObjs.keys}")
   start = System.currentTimeMillis()
   runBlocking {
   for (tableObj in tableObjs.values) {
   launch(Dispatchers.IO) {
   try {
   if (tableObj.isInitalized()) {
   tableObj.getAppendFiles().commit()
   }
   } catch (e: Exception) {
   logger.error(e.message)
   e.printStackTrace()
   failures.addAll(tableObj.sqsMessageIds)
   }
   }
   }
   }
   
   logger.info("Committed tables in ${System.currentTimeMillis() - 
start} ms")
   ```
   
   Is this a bug in the Glue Iceberg code, or how should we protect ourselves 
from a situation where the Iceberg table is left pointing to an invalid 
location because of failed commits due to concurrent modifications thrown by 
Glue?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Add support for reading Iceberg views [iceberg]

2024-01-04 Thread via GitHub


rdblue merged PR #9340:
URL: https://github.com/apache/iceberg/pull/9340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Add support for reading Iceberg views [iceberg]

2024-01-04 Thread via GitHub


rdblue commented on PR #9340:
URL: https://github.com/apache/iceberg/pull/9340#issuecomment-1877413431

   Merged! Thanks for getting this working @nastra!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-04 Thread via GitHub


rdblue merged PR #9359:
URL: https://github.com/apache/iceberg/pull/9359


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deliver key metadata for encryption of data files [iceberg]

2024-01-04 Thread via GitHub


rdblue commented on PR #9359:
URL: https://github.com/apache/iceberg/pull/9359#issuecomment-1877443032

   Merged! Thanks, @ggershinsky for getting this done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Create JUnit5 version of TestFlinkScan [iceberg]

2024-01-04 Thread via GitHub


pvary commented on PR #9185:
URL: https://github.com/apache/iceberg/pull/9185#issuecomment-1877465254

   Do we have a way to check if the new tests were running on the CI? With the 
correct parameters?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Create JUnit5 version of TestFlinkScan [iceberg]

2024-01-04 Thread via GitHub


nastra commented on PR #9185:
URL: https://github.com/apache/iceberg/pull/9185#issuecomment-1877466639

   > Do we have a way to check if the new tests were running on the CI? With 
the correct parameters?
   
   I don't think so, but I was verifying them all locally


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core, Spark 3.5: Parallelize reading of deletes and cache them on executors [iceberg]

2024-01-04 Thread via GitHub


aokolnychyi commented on code in PR #8755:
URL: https://github.com/apache/iceberg/pull/8755#discussion_r1442126743


##
core/src/main/java/org/apache/iceberg/deletes/EmptyPositionDeleteIndex.java:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.deletes;
+
+class EmptyPositionDeleteIndex implements PositionDeleteIndex {
+
+  private static final EmptyPositionDeleteIndex INSTANCE = new 
EmptyPositionDeleteIndex();
+
+  private EmptyPositionDeleteIndex() {}
+
+  static EmptyPositionDeleteIndex get() {
+return INSTANCE;
+  }
+
+  @Override
+  public void delete(long position) {
+throw new UnsupportedOperationException("Cannot modify " + 
getClass().getName());
+  }
+
+  @Override
+  public void delete(long posStart, long posEnd) {
+throw new UnsupportedOperationException("Cannot modify " + 
getClass().getName());
+  }
+
+  @Override
+  public boolean isDeleted(long position) {
+return false;
+  }
+
+  @Override
+  public boolean isEmpty() {
+return true;
+  }
+
+  @Override
+  public String toString() {
+return "PositionDeleteIndex{}";

Review Comment:
   Agree, that would be nice to have a reasonable `toString()` implementation 
in others.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Support renaming views [iceberg]

2024-01-04 Thread via GitHub


rdblue commented on code in PR #9343:
URL: https://github.com/apache/iceberg/pull/9343#discussion_r1442151540


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveViews.scala:
##
@@ -53,6 +57,17 @@ case class ResolveViews(spark: SparkSession) extends 
Rule[LogicalPlan] with Look
   loadView(catalog, ident)
 .map(createViewRelation(parts, _))
 .getOrElse(u)
+
+case u@UnresolvedTableOrView(CatalogAndIdentifier(catalog, ident), _, _) =>
+  loadView(catalog, ident)
+.map(_ => ResolvedV2View(catalog.asViewCatalog, ident))
+.getOrElse(u)
+
+case RenameTable(ResolvedV2View(oldCatalog, oldIdent), 
CatalogAndIdentifier(newCatalog, newIdent), true) =>

Review Comment:
   What is `true`? Can we add a name to this so that it is more readable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Support renaming views [iceberg]

2024-01-04 Thread via GitHub


rdblue commented on code in PR #9343:
URL: https://github.com/apache/iceberg/pull/9343#discussion_r1442152596


##
spark/v3.5/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveViews.scala:
##
@@ -53,6 +57,17 @@ case class ResolveViews(spark: SparkSession) extends 
Rule[LogicalPlan] with Look
   loadView(catalog, ident)
 .map(createViewRelation(parts, _))
 .getOrElse(u)
+
+case u@UnresolvedTableOrView(CatalogAndIdentifier(catalog, ident), _, _) =>
+  loadView(catalog, ident)
+.map(_ => ResolvedV2View(catalog.asViewCatalog, ident))
+.getOrElse(u)
+
+case RenameTable(ResolvedV2View(oldCatalog, oldIdent), 
CatalogAndIdentifier(newCatalog, newIdent), true) =>

Review Comment:
   Is it possible for the rule on line 61 to resolve the multi-part `to` 
identifier before this rule runs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark: Support renaming views [iceberg]

2024-01-04 Thread via GitHub


rdblue commented on code in PR #9343:
URL: https://github.com/apache/iceberg/pull/9343#discussion_r1442153222


##
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestViews.java:
##
@@ -590,10 +591,7 @@ public void fullFunctionIdentifier() {
   @Test
   public void fullFunctionIdentifierNotRewrittenLoadFailure() {
 String viewName = "fullFunctionIdentifierNotRewrittenLoadFailure";
-String sql =
-String.format(
-"SELECT spark_catalog.system.bucket(100, 'a') AS bucket_result, 
'a' AS value",
-catalogName);
+String sql = "SELECT spark_catalog.system.bucket(100, 'a') AS 
bucket_result, 'a' AS value";

Review Comment:
   :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]

2024-01-04 Thread via GitHub


jackye1995 merged PR #9385:
URL: https://github.com/apache/iceberg/pull/9385


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] AWS: Add S3 Access Grants Integration [iceberg]

2024-01-04 Thread via GitHub


jackye1995 commented on PR #9385:
URL: https://github.com/apache/iceberg/pull/9385#issuecomment-1877659221

   Seems like this is not getting much attraction from other reviewers. Given 
it is a pretty straightforward plugin integration, I will go ahead to merge it. 
We can fix things later if necessary. This also opens the question that if 
there is any other valuable plugins that we should consider adding 
integrations, we can do a search afterwards. Thanks for the work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Set log level to WARN for rewrite task failure with partial progress [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9400:
URL: https://github.com/apache/iceberg/pull/9400#discussion_r1442227949


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java:
##
@@ -345,7 +345,7 @@ private Result doExecuteWithPartialProgress(
 .noRetry()
 .onFailure(
 (fileGroup, exception) -> {
-  LOG.error("Failure during rewrite group {}", fileGroup.info(), 
exception);
+  LOG.warn("Failure during rewrite group {}", fileGroup.info(), 
exception);

Review Comment:
   Hm I think I get the rationale that during partial progress it could be too 
noisy if we error logged every failure. But the issue is let's say most of the 
commit tasks are failures, and we only see warn logs. We would not really see 
any indication of what the failed partitions are in the logs (at least I don't 
think so based off reading the code). The only case we error log is when all 
tasks fail which seems insufficient logging in case of failure.
   
   I actually think error is fine here. Are you seeing a lot of noisiness in 
this in any systems you may be running?
   
   I also noticed that the non-partial progress case is also warn, but tbh that 
also doesn't seem right. What do you think @manuzhang also cc @szehon-ho 
@RussellSpitzer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Set log level to WARN for rewrite task failure with partial progress [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9400:
URL: https://github.com/apache/iceberg/pull/9400#discussion_r1442227949


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java:
##
@@ -345,7 +345,7 @@ private Result doExecuteWithPartialProgress(
 .noRetry()
 .onFailure(
 (fileGroup, exception) -> {
-  LOG.error("Failure during rewrite group {}", fileGroup.info(), 
exception);
+  LOG.warn("Failure during rewrite group {}", fileGroup.info(), 
exception);

Review Comment:
   Hm I think I get the rationale that during partial progress it could be too 
noisy if we error logged every failure. But the issue is let's say most of the 
commit tasks are failures, and we only see warn logs which are easy to get 
masked. The only case we error log is when all tasks fail which seems 
insufficient logging in case of failure.
   
   I actually think error is fine here. Are you seeing a lot of noisiness in 
this in any systems you may be running?
   
   I also noticed that the non-partial progress case is also warn, but tbh that 
also doesn't seem right. What do you think @manuzhang also cc @szehon-ho 
@RussellSpitzer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Set log level to WARN for rewrite task failure with partial progress [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9400:
URL: https://github.com/apache/iceberg/pull/9400#discussion_r1442236190


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java:
##
@@ -345,7 +345,7 @@ private Result doExecuteWithPartialProgress(
 .noRetry()
 .onFailure(
 (fileGroup, exception) -> {
-  LOG.error("Failure during rewrite group {}", fileGroup.info(), 
exception);
+  LOG.warn("Failure during rewrite group {}", fileGroup.info(), 
exception);

Review Comment:
   Another idea is that we log an error if the commit results size is less than 
some threshold rather than completely empty. This would eliminate the task 
level noise from errors but at the same time indicate to logging systems that 
the compaction isn't super effective.
   
   ```
   if (commitResults.size() == 0) {
 LOG.error(
 "{} is true but no rewrite commits succeeded. Check the logs to 
determine why the individual "
 + "commits failed. If this is persistent it may help to 
increase {} which will break the rewrite operation "
 + "into smaller commits.",
 PARTIAL_PROGRESS_ENABLED,
 PARTIAL_PROGRESS_MAX_COMMITS);
   }
   
   ```
   
   
   to
   
   
   
   if (commitResults.size()/commitAttempts <= SOME_THRESHOLD) {
 LOG.error(
 "{} is true but less than SOME_THRESHOLD rewrite commits 
succeeded. Check the logs to determine why the individual "
 + "commits failed. If this is persistent it may help to 
increase {} which will break the rewrite operation "
 + "into smaller commits.",
 PARTIAL_PROGRESS_ENABLED,
 PARTIAL_PROGRESS_MAX_COMMITS);
   }
   ```
   
   but maybe determining commitAttempts is overcomplicated.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Remove deprecated method from BaseMetadataTable [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9298:
URL: https://github.com/apache/iceberg/pull/9298#discussion_r1442241575


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/BaseFileRewriteCoordinator.java:
##
@@ -72,18 +70,12 @@ public void clearRewrite(Table table, String fileSetId) {
 
   public Set fetchSetIds(Table table) {
 return resultMap.keySet().stream()
-.filter(e -> e.first().equals(tableUUID(table)))
+.filter(e -> e.first().equals(Spark3Util.tableUUID(table)))

Review Comment:
   Nit: I think it would be a bit cleaner just to import Spark3Util.tableUUID 
and then just use tableUUID here (the diff on line 73 and other places would 
essentially go away in favor of just a new import statement)



##
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##
@@ -948,6 +950,17 @@ public static 
org.apache.spark.sql.catalyst.TableIdentifier toV1TableIdentifier(
 return org.apache.spark.sql.catalyst.TableIdentifier.apply(table, 
database);
   }
 
+  static String tableUUID(org.apache.iceberg.Table table) {
+if (table instanceof HasTableOperations) {
+  TableOperations ops = ((HasTableOperations) table).operations();
+  return ops.current().uuid();
+} else if (table instanceof BaseMetadataTable) {
+  return ((BaseMetadataTable) table).table().operations().current().uuid();
+} else {
+  return null;

Review Comment:
   I think this should probably throw an exception instead of returning null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Remove deprecated method from BaseMetadataTable [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9298:
URL: https://github.com/apache/iceberg/pull/9298#discussion_r1442246021


##
spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java:
##
@@ -948,6 +950,17 @@ public static 
org.apache.spark.sql.catalyst.TableIdentifier toV1TableIdentifier(
 return org.apache.spark.sql.catalyst.TableIdentifier.apply(table, 
database);
   }
 
+  static String tableUUID(org.apache.iceberg.Table table) {
+if (table instanceof HasTableOperations) {
+  TableOperations ops = ((HasTableOperations) table).operations();
+  return ops.current().uuid();
+} else if (table instanceof BaseMetadataTable) {
+  return ((BaseMetadataTable) table).table().operations().current().uuid();
+} else {
+  return null;

Review Comment:
   Also instead of `tableUUID` maybe `baseTableUUID` sine that's what we're 
really getting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Remove deprecated method from BaseMetadataTable [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9298:
URL: https://github.com/apache/iceberg/pull/9298#discussion_r1442241575


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/BaseFileRewriteCoordinator.java:
##
@@ -72,18 +70,12 @@ public void clearRewrite(Table table, String fileSetId) {
 
   public Set fetchSetIds(Table table) {
 return resultMap.keySet().stream()
-.filter(e -> e.first().equals(tableUUID(table)))
+.filter(e -> e.first().equals(Spark3Util.tableUUID(table)))

Review Comment:
   Nit: I think it would be a bit cleaner just to import Spark3Util.tableUUID 
and then just use tableUUID here (the diff on line 73 and other places would 
essentially go away in favor of just a new import statement). But that's nbd, 
if we do the method rename like I suggested we lose this benefit anyways.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Remove partition statistics files during purge table [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on PR #9409:
URL: https://github.com/apache/iceberg/pull/9409#issuecomment-1877783387

   Thanks @dramaticlly for the review, merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Remove partition statistics files during purge table [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar merged PR #9409:
URL: https://github.com/apache/iceberg/pull/9409


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Correct schema behavior [iceberg-python]

2024-01-04 Thread via GitHub


Fokko commented on code in PR #247:
URL: https://github.com/apache/iceberg-python/pull/247#discussion_r1442296384


##
pyiceberg/table/__init__.py:
##
@@ -942,15 +942,16 @@ def snapshot(self) -> Optional[Snapshot]:
 return self.table.current_snapshot()
 
 def projection(self) -> Schema:
-snapshot_schema = self.table.schema()
-if snapshot := self.snapshot():
-if snapshot.schema_id is not None:
-snapshot_schema = self.table.schemas()[snapshot.schema_id]
+current_schema = self.table.schema()
+if self.snapshot_id is not None:
+snapshot = self.table.snapshot_by_id(self.snapshot_id)

Review Comment:
   Oof, that's a good one. I think we should check if the snapshot-id is valid 
earlier in the process.  I've added a check now, but I'll follow up with 
another PR to make this more strict.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Add a Function variable as an extra partition filter to filter DataFiles [iceberg]

2024-01-04 Thread via GitHub


github-actions[bot] closed issue #6871: Add a Function variable as an extra partition filter to filter DataFiles
URL: https://github.com/apache/iceberg/issues/6871


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Add a Function variable as an extra partition filter to filter DataFiles [iceberg]

2024-01-04 Thread via GitHub


github-actions[bot] commented on issue #6871:
URL: https://github.com/apache/iceberg/issues/6871#issuecomment-1877934782

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Big manifest] If the manifest file is very big, the decode cost time is very long [iceberg]

2024-01-04 Thread via GitHub


github-actions[bot] commented on issue #6868:
URL: https://github.com/apache/iceberg/issues/6868#issuecomment-1877934801

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Cache delete files when reading v2 format with merge-on-read mode [iceberg]

2024-01-04 Thread via GitHub


github-actions[bot] closed issue #6865: Cache delete files when reading v2 
format with merge-on-read mode
URL: https://github.com/apache/iceberg/issues/6865


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Cache delete files when reading v2 format with merge-on-read mode [iceberg]

2024-01-04 Thread via GitHub


github-actions[bot] commented on issue #6865:
URL: https://github.com/apache/iceberg/issues/6865#issuecomment-1877934826

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Big manifest] If the manifest file is very big, the decode cost time is very long [iceberg]

2024-01-04 Thread via GitHub


zombee0 closed issue #6868: [Big manifest] If the manifest file is very big, 
the decode cost time is very long
URL: https://github.com/apache/iceberg/issues/6868


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] check_github_labeler_v5 [iceberg]

2024-01-04 Thread via GitHub


panbingkun closed pull request #9413: check_github_labeler_v5
URL: https://github.com/apache/iceberg/pull/9413


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#issuecomment-1878030924

   Hi @Fokko, if you think this PR is too large, I can split it up. For 
instance, I could add the scripts in separate PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Failed to assign splits due to the serialized split size [iceberg]

2024-01-04 Thread via GitHub


javrasya commented on issue #9410:
URL: https://github.com/apache/iceberg/issues/9410#issuecomment-1878085557

   It seems this bug has been introduced by version 1.4.0 which is kid of new. 
Tried fixing it by tweaking the SplitAssignerFactory I pass down to the 
IcebergSource but even though I reduce the size of FileScanTasks per split to 
be one, it still exceeds that 65K limit. So I ended up downgrading to 1.3 
unfortunately until it is fixed with 1.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


liurenjie1024 commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442448609


##
crates/catalog/hms/DEPENDENCIES.rust.tsv:
##


Review Comment:
   What are these tsv files used for?



##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new version 
in accordance with the Apache requirements.
+
+## Introduction
+
+`Source Release` is the key point which Apache values, and is also necessary 
for an ASF release.
+
+Please remember that publishing software has legal consequences.
+
+This guide complements the foundation-wide policies and guides:
+
+- [Release Policy](https://www.apache.org/legal/release-policy.html)
+- [Release Distribution Policy](https://infra.apache.org/release-distribution)
+- [Release Creation Process](https://infra.apache.org/release-publishing.html)
+
+## Some Terminology of release
+
+In the context of our release, we use several terms to describe different 
stages of the release process.
+
+Here's an explanation of these terms:
+
+- `iceberg_version`: the version of Iceberg to be released, like `0.2.0`.
+- `release_version`: the version of release candidate, like `0.2.0-rc.1`.
+- `rc_version`: the minor version for voting round, like `rc.1`.
+
+## Preparation
+
+
+
+This section is the requirements for individuals who are new to the role of 
release manager.
+
+
+
+Refer to [Setup GPG Key](reference/setup_gpg.md) to make sure the GPG key has 
been set up.
+
+## Start a tracking issue about the next release
+
+Start a tracking issue on GitHub for the upcoming release to track all tasks 
that need to be completed.
+
+Title:
+
+```
+Tracking issues of Iceberg Rust ${iceberg_version} Release
+```
+
+Content:
+
+```markdown
+This issue is used to track tasks of the iceberg rust ${iceberg_version} 
release.
+
+## Tasks
+
+### Blockers
+
+> Blockers are the tasks that must be completed before the release.
+
+### Build Release
+
+ GitHub Side
+
+- [ ] Bump version in project
+- [ ] Update docs
+- [ ] Generate dependencies list
+- [ ] Push release candidate tag to GitHub
+
+ ASF Side
+
+- [ ] Create an ASF Release
+- [ ] Upload artifacts to the SVN dist repo
+
+### Voting

Review Comment:
   How about moving this before `ASF Side` as first step?



##
website/src/download.md:
##


Review Comment:
   `iceberg-rust` mainly serves as a library, I think we already have 
`install.md` to tell user how to use it, so I think maybe we don't need this? 
If this is telling user to download source from apache website, adding one 
section in `install.md` would be enough?



##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new version 
in accordance with the Apache requirements.
+
+## Introduction
+
+`Source Release` is the key point which Apache values, and is also necessary 
for an ASF release.
+
+Please remember that publishing software has legal consequences.
+
+This guide complements the foundation-wide policies and guides:
+
+- [Release Policy](https://www.apache.org/legal/release-policy.html)
+- [Release Distribution Policy](https://infra.apache.org/release-distribution)
+- [Release Creation Process](https://infra.apache.org/release-publishing.html)
+
+## Some Terminology of release
+
+In the context of our release, we use several terms to describe different 
stages of the release process.
+
+Here's an explanation of these terms:
+
+- `iceberg_version`: the version of Iceberg to be released, like `0.2.0`.
+- `release_version`: the version of release candidate, like `0.2.0-rc.1`.
+- `rc_version`: the minor version for voting round, like `rc.1`.
+
+## Preparation
+
+
+
+This section is the requirements for individuals who are new to the role of 
release manager.
+
+
+
+Refer to [Setup GPG Key](reference/setup_gpg.md) to make sure the GPG key has 
been set up.
+
+## Start a tracking issue about the next release
+
+Start a tracking issue on GitHub for the upcoming release to track all tasks 
that need to be completed.
+
+Title:
+
+```
+Tracking issues of Iceberg Rust ${iceberg_version} Release
+```
+
+Content:
+
+```markdown
+This issue is used to track tasks of the iceberg rust ${iceberg_version} 
release.
+
+## Tasks
+
+### Blockers
+
+> Blockers are the tasks that must be completed before the release.
+
+### Build Release
+
+ GitHub Side
+
+- [ ] Bump version in project
+- [ ] Update docs
+- [ ] Generate dependencies list
+- [ ] Push release candidate tag to GitHub
+
+ ASF Side
+
+- [ ] Create an ASF Release
+- [ ] Upload artifacts to the SVN dist repo
+
+### Voting
+
+- [ ] Start VOTE at iceberg community
+
+### Official Release

Review Comment:
   I think one missing part is update crates.io?



##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new versi

Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442472242


##
crates/catalog/hms/DEPENDENCIES.rust.tsv:
##


Review Comment:
   The ASF release requires users to have a complete list of our dependencies 
to ensure compliance with Apache 2.0 license.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442472860


##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new version 
in accordance with the Apache requirements.
+
+## Introduction
+
+`Source Release` is the key point which Apache values, and is also necessary 
for an ASF release.
+
+Please remember that publishing software has legal consequences.
+
+This guide complements the foundation-wide policies and guides:
+
+- [Release Policy](https://www.apache.org/legal/release-policy.html)
+- [Release Distribution Policy](https://infra.apache.org/release-distribution)
+- [Release Creation Process](https://infra.apache.org/release-publishing.html)
+
+## Some Terminology of release
+
+In the context of our release, we use several terms to describe different 
stages of the release process.
+
+Here's an explanation of these terms:
+
+- `iceberg_version`: the version of Iceberg to be released, like `0.2.0`.
+- `release_version`: the version of release candidate, like `0.2.0-rc.1`.
+- `rc_version`: the minor version for voting round, like `rc.1`.
+
+## Preparation
+
+
+
+This section is the requirements for individuals who are new to the role of 
release manager.
+
+
+
+Refer to [Setup GPG Key](reference/setup_gpg.md) to make sure the GPG key has 
been set up.
+
+## Start a tracking issue about the next release
+
+Start a tracking issue on GitHub for the upcoming release to track all tasks 
that need to be completed.
+
+Title:
+
+```
+Tracking issues of Iceberg Rust ${iceberg_version} Release
+```
+
+Content:
+
+```markdown
+This issue is used to track tasks of the iceberg rust ${iceberg_version} 
release.
+
+## Tasks
+
+### Blockers
+
+> Blockers are the tasks that must be completed before the release.
+
+### Build Release
+
+ GitHub Side
+
+- [ ] Bump version in project
+- [ ] Update docs
+- [ ] Generate dependencies list
+- [ ] Push release candidate tag to GitHub
+
+ ASF Side
+
+- [ ] Create an ASF Release
+- [ ] Upload artifacts to the SVN dist repo

Review Comment:
   Yes. Only source releases in ASF SVN are considered official; all other 
releases are provided merely for user convenience. ASF adheres to this approach 
to ensure that ASF releases are always accessible and not restricted by 
specific vendors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442473048


##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new version 
in accordance with the Apache requirements.
+
+## Introduction
+
+`Source Release` is the key point which Apache values, and is also necessary 
for an ASF release.
+
+Please remember that publishing software has legal consequences.
+
+This guide complements the foundation-wide policies and guides:
+
+- [Release Policy](https://www.apache.org/legal/release-policy.html)
+- [Release Distribution Policy](https://infra.apache.org/release-distribution)
+- [Release Creation Process](https://infra.apache.org/release-publishing.html)
+
+## Some Terminology of release
+
+In the context of our release, we use several terms to describe different 
stages of the release process.
+
+Here's an explanation of these terms:
+
+- `iceberg_version`: the version of Iceberg to be released, like `0.2.0`.
+- `release_version`: the version of release candidate, like `0.2.0-rc.1`.
+- `rc_version`: the minor version for voting round, like `rc.1`.
+
+## Preparation
+
+
+
+This section is the requirements for individuals who are new to the role of 
release manager.
+
+
+
+Refer to [Setup GPG Key](reference/setup_gpg.md) to make sure the GPG key has 
been set up.
+
+## Start a tracking issue about the next release
+
+Start a tracking issue on GitHub for the upcoming release to track all tasks 
that need to be completed.
+
+Title:
+
+```
+Tracking issues of Iceberg Rust ${iceberg_version} Release
+```
+
+Content:
+
+```markdown
+This issue is used to track tasks of the iceberg rust ${iceberg_version} 
release.
+
+## Tasks
+
+### Blockers
+
+> Blockers are the tasks that must be completed before the release.
+
+### Build Release
+
+ GitHub Side
+
+- [ ] Bump version in project
+- [ ] Update docs
+- [ ] Generate dependencies list
+- [ ] Push release candidate tag to GitHub
+
+ ASF Side
+
+- [ ] Create an ASF Release
+- [ ] Upload artifacts to the SVN dist repo
+
+### Voting

Review Comment:
   We will vote on the artifacts that have been uploaded.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442473406


##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new version 
in accordance with the Apache requirements.
+
+## Introduction
+
+`Source Release` is the key point which Apache values, and is also necessary 
for an ASF release.
+
+Please remember that publishing software has legal consequences.
+
+This guide complements the foundation-wide policies and guides:
+
+- [Release Policy](https://www.apache.org/legal/release-policy.html)
+- [Release Distribution Policy](https://infra.apache.org/release-distribution)
+- [Release Creation Process](https://infra.apache.org/release-publishing.html)
+
+## Some Terminology of release
+
+In the context of our release, we use several terms to describe different 
stages of the release process.
+
+Here's an explanation of these terms:
+
+- `iceberg_version`: the version of Iceberg to be released, like `0.2.0`.
+- `release_version`: the version of release candidate, like `0.2.0-rc.1`.
+- `rc_version`: the minor version for voting round, like `rc.1`.
+
+## Preparation
+
+
+
+This section is the requirements for individuals who are new to the role of 
release manager.
+
+
+
+Refer to [Setup GPG Key](reference/setup_gpg.md) to make sure the GPG key has 
been set up.
+
+## Start a tracking issue about the next release
+
+Start a tracking issue on GitHub for the upcoming release to track all tasks 
that need to be completed.
+
+Title:
+
+```
+Tracking issues of Iceberg Rust ${iceberg_version} Release
+```
+
+Content:
+
+```markdown
+This issue is used to track tasks of the iceberg rust ${iceberg_version} 
release.
+
+## Tasks
+
+### Blockers
+
+> Blockers are the tasks that must be completed before the release.
+
+### Build Release
+
+ GitHub Side
+
+- [ ] Bump version in project
+- [ ] Update docs
+- [ ] Generate dependencies list
+- [ ] Push release candidate tag to GitHub
+
+ ASF Side
+
+- [ ] Create an ASF Release
+- [ ] Upload artifacts to the SVN dist repo
+
+### Voting
+
+- [ ] Start VOTE at iceberg community
+
+### Official Release
+
+- [ ] Push the release git tag
+- [ ] Publish artifacts to SVN RELEASE branch
+- [ ] Change Iceberg Rust Website download link
+- [ ] Send the announcement
+
+For details of each step, please refer to: 
https://rust.iceberg.apache.org/release
+```
+
+## GitHub Side
+
+### Bump version in project
+
+Bump all components' version in the project to the new iceberg version.
+Please note that this version is the exact version of the release, not the 
release candidate version.
+
+- rust core: bump version in `Cargo.toml`
+
+### Update docs
+
+- Update `CHANGELOG.md`, refer to [Generate Release 
Note](reference/generate_release_note.md) for more information.
+
+### Generate dependencies list
+
+Download and setup `cargo-deny`. You can refer to 
[cargo-deny](https://embarkstudios.github.io/cargo-deny/cli/index.html).
+
+Running `python3 ./scripts/dependencies.py generate` to update the 
dependencies list of every package.
+
+### Push release candidate tag
+
+After bump version PR gets merged, we can create a GitHub release for the 
release candidate:
+
+- Create a tag at `main` branch on the `Bump Version` / `Patch up version` 
commit: `git tag -s "v0.2.0-rc.1"`, please correctly check out the 
corresponding commit instead of directly tagging on the main branch.
+- Push tags to GitHub: `git push --tags`.
+
+## ASF Side
+
+If any step in the ASF Release process fails and requires code changes,
+we will abandon that version and prepare for the next one.
+Our release page will only display ASF releases instead of GitHub Releases.
+
+### Create an ASF Release
+
+After GitHub Release has been created, we can start to create ASF Release.
+
+- Checkout to released tag. (e.g. `git checkout v0.2.0-rc.1`, tag is created 
in the previous step)
+- Use the release script to create a new release: 
`ICEBERG_VERSION= ICEBERG_VERSION_RC= 
./scripts/release.sh`(e.g. `ICEBERG_VERSION=0.2.0 ICEBERG_VERSION_RC=rc.1 
./scripts/release.sh`)
+- This script will do the following things:
+- Create a new branch named by `release-${release_version}` from the 
tag
+- Generate the release candidate artifacts under `dist`, including:
+- `apache-iceberg-rust-${release_version}-src.tar.gz`
+- `apache-iceberg-rust-${release_version}-src.tar.gz.asc`
+- `apache-iceberg-rust-${release_version}-src.tar.gz.sha512`
+- Check the header of the source code. This step needs docker to run.
+- Push the newly created branch to GitHub
+
+This script will create a new release under `dist`.
+
+For example:
+
+```shell
+> tree dist
+dist
+├── apache-iceberg-rust-0.2.0-src.tar.gz
+├── apache-iceberg-rust-0.2.0-src.tar.gz.asc
+└── apache-iceberg-rust-0.2.0-src.tar.gz.sha512
+```
+
+### Upload artifacts to the SVN dist repo
+
+SVN i

Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442473747


##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new version 
in accordance with the Apache requirements.
+
+## Introduction
+
+`Source Release` is the key point which Apache values, and is also necessary 
for an ASF release.
+
+Please remember that publishing software has legal consequences.
+
+This guide complements the foundation-wide policies and guides:
+
+- [Release Policy](https://www.apache.org/legal/release-policy.html)
+- [Release Distribution Policy](https://infra.apache.org/release-distribution)
+- [Release Creation Process](https://infra.apache.org/release-publishing.html)
+
+## Some Terminology of release
+
+In the context of our release, we use several terms to describe different 
stages of the release process.
+
+Here's an explanation of these terms:
+
+- `iceberg_version`: the version of Iceberg to be released, like `0.2.0`.
+- `release_version`: the version of release candidate, like `0.2.0-rc.1`.
+- `rc_version`: the minor version for voting round, like `rc.1`.
+
+## Preparation
+
+
+
+This section is the requirements for individuals who are new to the role of 
release manager.
+
+
+
+Refer to [Setup GPG Key](reference/setup_gpg.md) to make sure the GPG key has 
been set up.
+
+## Start a tracking issue about the next release
+
+Start a tracking issue on GitHub for the upcoming release to track all tasks 
that need to be completed.
+
+Title:
+
+```
+Tracking issues of Iceberg Rust ${iceberg_version} Release
+```
+
+Content:
+
+```markdown
+This issue is used to track tasks of the iceberg rust ${iceberg_version} 
release.
+
+## Tasks
+
+### Blockers
+
+> Blockers are the tasks that must be completed before the release.
+
+### Build Release
+
+ GitHub Side
+
+- [ ] Bump version in project
+- [ ] Update docs
+- [ ] Generate dependencies list
+- [ ] Push release candidate tag to GitHub
+
+ ASF Side
+
+- [ ] Create an ASF Release
+- [ ] Upload artifacts to the SVN dist repo
+
+### Voting
+
+- [ ] Start VOTE at iceberg community
+
+### Official Release

Review Comment:
   Publishing to crates.io is optional and should be handled by GitHub CI. I'll 
include a section on this once the crates.io publishing CI setup is complete.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


Xuanwo commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442474047


##
website/src/download.md:
##


Review Comment:
   > I think maybe we don't need this
   
   ASF release policy requires that users have the ability to download and 
build from source.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] API, Core: Add withUpdatedColumnDoc API to Schema [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar opened a new pull request, #9414:
URL: https://github.com/apache/iceberg/pull/9414

   When working on Trino Iceberg view support, I realized that updating column 
comments across engines is the same logic 1.) Get the view schema
   2.) Find the field to update, update it
   3.) Return a new Schema (preserving all the other attributes like identifier 
fields, aliases)
   4.) Replace the schema on the view
   
   This PR puts 1-3 into a separate API just so engines don't have to duplicate 
this logic.
   I don't think this matters for Spark since the way to update column comments 
is doing a `REPLACE VIEW`
   which will need to visit the schema separately for any changes, but  Trino + 
other engines support operations which just update column comments, and I think 
having an Iceberg API which just does that would be convenient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add withUpdatedColumnDoc API to Schema [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442512815


##
core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java:
##
@@ -1739,4 +1739,61 @@ public void testSqlForInvalidArguments() {
 .isInstanceOf(IllegalArgumentException.class)
 .hasMessage("Invalid dialect: (empty string)");
   }
+
+  @Test
+  public void testReplaceViewWithUpdatedColumnDoc() {
+TableIdentifier identifier = TableIdentifier.of("ns", "view");
+
+if (requiresNamespaceCreate()) {
+  catalog().createNamespace(identifier.namespace());
+}
+
+View view =
+catalog()
+.buildView(identifier)
+.withSchema(SCHEMA)
+.withDefaultNamespace(identifier.namespace())
+.withDefaultCatalog(catalog().name())
+.withQuery("spark", "select * from ns.tbl")
+.create();
+
+view.replaceVersion()
+.withDefaultCatalog(view.currentVersion().defaultCatalog())
+.withDefaultNamespace(identifier.namespace())
+.withSchema(view.schema().withColumnDoc("data", "some docs for data"))
+.withQuery("spark", view.sqlFor("spark").sql())
+.commit();

Review Comment:
   We can actually take this a step further by adding an API to view , 
view.updateColumnDoc which returns `ReplaceViewVersion` which is already 
initialized with namespace/catalog/query details since those will all be the 
same for someone who just wants to update the doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add withUpdatedColumnDoc API to Schema [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442516050


##
core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java:
##
@@ -1739,4 +1739,61 @@ public void testSqlForInvalidArguments() {
 .isInstanceOf(IllegalArgumentException.class)
 .hasMessage("Invalid dialect: (empty string)");
   }
+
+  @Test
+  public void testReplaceViewWithUpdatedColumnDoc() {
+TableIdentifier identifier = TableIdentifier.of("ns", "view");
+
+if (requiresNamespaceCreate()) {
+  catalog().createNamespace(identifier.namespace());
+}
+
+View view =
+catalog()
+.buildView(identifier)
+.withSchema(SCHEMA)
+.withDefaultNamespace(identifier.namespace())
+.withDefaultCatalog(catalog().name())
+.withQuery("spark", "select * from ns.tbl")
+.create();
+
+view.replaceVersion()
+.withDefaultCatalog(view.currentVersion().defaultCatalog())
+.withDefaultNamespace(identifier.namespace())
+.withSchema(view.schema().withColumnDoc("data", "some docs for data"))
+.withQuery("spark", view.sqlFor("spark").sql())
+.commit();

Review Comment:
   Actually I think that's more than just a nice to have, since we need to make 
sure all dialects are preserved in the replace. It's good to have that be in 
the API implementation itself rather than have different engine behaviors.



##
core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java:
##
@@ -1739,4 +1739,61 @@ public void testSqlForInvalidArguments() {
 .isInstanceOf(IllegalArgumentException.class)
 .hasMessage("Invalid dialect: (empty string)");
   }
+
+  @Test
+  public void testReplaceViewWithUpdatedColumnDoc() {
+TableIdentifier identifier = TableIdentifier.of("ns", "view");
+
+if (requiresNamespaceCreate()) {
+  catalog().createNamespace(identifier.namespace());
+}
+
+View view =
+catalog()
+.buildView(identifier)
+.withSchema(SCHEMA)
+.withDefaultNamespace(identifier.namespace())
+.withDefaultCatalog(catalog().name())
+.withQuery("spark", "select * from ns.tbl")
+.create();
+
+view.replaceVersion()
+.withDefaultCatalog(view.currentVersion().defaultCatalog())
+.withDefaultNamespace(identifier.namespace())
+.withSchema(view.schema().withColumnDoc("data", "some docs for data"))
+.withQuery("spark", view.sqlFor("spark").sql())
+.commit();

Review Comment:
   Actually I think that's more than just a nice to have, since we need to make 
sure all dialects are preserved in the replace. It's importat to have that be 
in the API implementation itself rather than have different engine behaviors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


liurenjie1024 commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442519521


##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new version 
in accordance with the Apache requirements.
+
+## Introduction
+
+`Source Release` is the key point which Apache values, and is also necessary 
for an ASF release.
+
+Please remember that publishing software has legal consequences.
+
+This guide complements the foundation-wide policies and guides:
+
+- [Release Policy](https://www.apache.org/legal/release-policy.html)
+- [Release Distribution Policy](https://infra.apache.org/release-distribution)
+- [Release Creation Process](https://infra.apache.org/release-publishing.html)
+
+## Some Terminology of release
+
+In the context of our release, we use several terms to describe different 
stages of the release process.
+
+Here's an explanation of these terms:
+
+- `iceberg_version`: the version of Iceberg to be released, like `0.2.0`.
+- `release_version`: the version of release candidate, like `0.2.0-rc.1`.
+- `rc_version`: the minor version for voting round, like `rc.1`.
+
+## Preparation
+
+
+
+This section is the requirements for individuals who are new to the role of 
release manager.
+
+
+
+Refer to [Setup GPG Key](reference/setup_gpg.md) to make sure the GPG key has 
been set up.
+
+## Start a tracking issue about the next release
+
+Start a tracking issue on GitHub for the upcoming release to track all tasks 
that need to be completed.
+
+Title:
+
+```
+Tracking issues of Iceberg Rust ${iceberg_version} Release
+```
+
+Content:
+
+```markdown
+This issue is used to track tasks of the iceberg rust ${iceberg_version} 
release.
+
+## Tasks
+
+### Blockers
+
+> Blockers are the tasks that must be completed before the release.
+
+### Build Release
+
+ GitHub Side
+
+- [ ] Bump version in project
+- [ ] Update docs
+- [ ] Generate dependencies list
+- [ ] Push release candidate tag to GitHub
+
+ ASF Side
+
+- [ ] Create an ASF Release
+- [ ] Upload artifacts to the SVN dist repo
+
+### Voting
+
+- [ ] Start VOTE at iceberg community
+
+### Official Release
+
+- [ ] Push the release git tag
+- [ ] Publish artifacts to SVN RELEASE branch
+- [ ] Change Iceberg Rust Website download link
+- [ ] Send the announcement
+
+For details of each step, please refer to: 
https://rust.iceberg.apache.org/release
+```
+
+## GitHub Side
+
+### Bump version in project
+
+Bump all components' version in the project to the new iceberg version.
+Please note that this version is the exact version of the release, not the 
release candidate version.
+
+- rust core: bump version in `Cargo.toml`
+
+### Update docs
+
+- Update `CHANGELOG.md`, refer to [Generate Release 
Note](reference/generate_release_note.md) for more information.
+
+### Generate dependencies list
+
+Download and setup `cargo-deny`. You can refer to 
[cargo-deny](https://embarkstudios.github.io/cargo-deny/cli/index.html).
+
+Running `python3 ./scripts/dependencies.py generate` to update the 
dependencies list of every package.
+
+### Push release candidate tag
+
+After bump version PR gets merged, we can create a GitHub release for the 
release candidate:
+
+- Create a tag at `main` branch on the `Bump Version` / `Patch up version` 
commit: `git tag -s "v0.2.0-rc.1"`, please correctly check out the 
corresponding commit instead of directly tagging on the main branch.
+- Push tags to GitHub: `git push --tags`.
+
+## ASF Side
+
+If any step in the ASF Release process fails and requires code changes,
+we will abandon that version and prepare for the next one.
+Our release page will only display ASF releases instead of GitHub Releases.
+
+### Create an ASF Release
+
+After GitHub Release has been created, we can start to create ASF Release.
+
+- Checkout to released tag. (e.g. `git checkout v0.2.0-rc.1`, tag is created 
in the previous step)
+- Use the release script to create a new release: 
`ICEBERG_VERSION= ICEBERG_VERSION_RC= 
./scripts/release.sh`(e.g. `ICEBERG_VERSION=0.2.0 ICEBERG_VERSION_RC=rc.1 
./scripts/release.sh`)
+- This script will do the following things:
+- Create a new branch named by `release-${release_version}` from the 
tag
+- Generate the release candidate artifacts under `dist`, including:
+- `apache-iceberg-rust-${release_version}-src.tar.gz`
+- `apache-iceberg-rust-${release_version}-src.tar.gz.asc`
+- `apache-iceberg-rust-${release_version}-src.tar.gz.sha512`
+- Check the header of the source code. This step needs docker to run.
+- Push the newly created branch to GitHub
+
+This script will create a new release under `dist`.
+
+For example:
+
+```shell
+> tree dist
+dist
+├── apache-iceberg-rust-0.2.0-src.tar.gz
+├── apache-iceberg-rust-0.2.0-src.tar.gz.asc
+└── apache-iceberg-rust-0.2.0-src.tar.gz.sha512
+```
+
+### Upload artifacts to the SVN dist repo
+

Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


liurenjie1024 commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442520171


##
website/src/download.md:
##


Review Comment:
   > ASF release policy requires that users have the ability to download and 
build from source.
   
   Then how about adding one section in `install.md` with title `Build from 
source`? I think they are just one way of install?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] docs: Add release guide for iceberg-rust [iceberg-rust]

2024-01-04 Thread via GitHub


liurenjie1024 commented on code in PR #147:
URL: https://github.com/apache/iceberg-rust/pull/147#discussion_r1442520271


##
website/src/release.md:
##
@@ -0,0 +1,383 @@
+
+
+This document mainly introduces how the release manager releases a new version 
in accordance with the Apache requirements.
+
+## Introduction
+
+`Source Release` is the key point which Apache values, and is also necessary 
for an ASF release.
+
+Please remember that publishing software has legal consequences.
+
+This guide complements the foundation-wide policies and guides:
+
+- [Release Policy](https://www.apache.org/legal/release-policy.html)
+- [Release Distribution Policy](https://infra.apache.org/release-distribution)
+- [Release Creation Process](https://infra.apache.org/release-publishing.html)
+
+## Some Terminology of release
+
+In the context of our release, we use several terms to describe different 
stages of the release process.
+
+Here's an explanation of these terms:
+
+- `iceberg_version`: the version of Iceberg to be released, like `0.2.0`.
+- `release_version`: the version of release candidate, like `0.2.0-rc.1`.
+- `rc_version`: the minor version for voting round, like `rc.1`.
+
+## Preparation
+
+
+
+This section is the requirements for individuals who are new to the role of 
release manager.
+
+
+
+Refer to [Setup GPG Key](reference/setup_gpg.md) to make sure the GPG key has 
been set up.
+
+## Start a tracking issue about the next release
+
+Start a tracking issue on GitHub for the upcoming release to track all tasks 
that need to be completed.
+
+Title:
+
+```
+Tracking issues of Iceberg Rust ${iceberg_version} Release
+```
+
+Content:
+
+```markdown
+This issue is used to track tasks of the iceberg rust ${iceberg_version} 
release.
+
+## Tasks
+
+### Blockers
+
+> Blockers are the tasks that must be completed before the release.
+
+### Build Release
+
+ GitHub Side
+
+- [ ] Bump version in project
+- [ ] Update docs
+- [ ] Generate dependencies list
+- [ ] Push release candidate tag to GitHub
+
+ ASF Side
+
+- [ ] Create an ASF Release
+- [ ] Upload artifacts to the SVN dist repo
+
+### Voting
+
+- [ ] Start VOTE at iceberg community
+
+### Official Release

Review Comment:
   Got it, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442535706


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -128,4 +128,23 @@ public SQLViewRepresentation sqlFor(String dialect) {
 
 return closest;
   }
+
+  @Override
+  @SuppressWarnings("checkstyle:HiddenField")
+  public ReplaceViewVersion updateColumnDoc(String name, String doc) {

Review Comment:
   Alternatively we could implement `UpdateSchema` for Views and the only 
implementation is updateColumndoc, and everything else throws Unsupported. I 
was -1 that since even returning UpdateSchema to callers will make it seem like 
some sort of schema evolution on views is possible, but it's not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442535706


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -128,4 +128,23 @@ public SQLViewRepresentation sqlFor(String dialect) {
 
 return closest;
   }
+
+  @Override
+  @SuppressWarnings("checkstyle:HiddenField")
+  public ReplaceViewVersion updateColumnDoc(String name, String doc) {

Review Comment:
   Alternatively we could implement `UpdateSchema` for Views and the only 
implementation is updateColumndoc, and everything else throws Unsupported. I 
was -1 that since even returning UpdateSchema to callers will make it seem like 
some sort of schema evolution on views is possible, but it's not (and shouldn't 
be)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442540195


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -128,4 +128,23 @@ public SQLViewRepresentation sqlFor(String dialect) {
 
 return closest;
   }
+
+  @Override
+  @SuppressWarnings("checkstyle:HiddenField")
+  public ReplaceViewVersion updateColumnDoc(String name, String doc) {

Review Comment:
   Ah the awkward thing here is method chaining
   
   ```view.updateColumnDoc("col1", "docs").updateColumnDoc("col2", 
"docs").commit()``` will only update column2 since we're not really doing any 
change tracking.
   
   Maybe this should be updated to allow multiple column/docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442540195


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -128,4 +128,23 @@ public SQLViewRepresentation sqlFor(String dialect) {
 
 return closest;
   }
+
+  @Override
+  @SuppressWarnings("checkstyle:HiddenField")
+  public ReplaceViewVersion updateColumnDoc(String name, String doc) {

Review Comment:
   Ah the awkward thing here is method chaining
   
   view.updateColumnDoc("col1", "docs").updateColumnDoc("col2", 
"docs").commit() will only update column2.
   
   Maybe this should be updated to allow multiple column/docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442540195


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -128,4 +128,23 @@ public SQLViewRepresentation sqlFor(String dialect) {
 
 return closest;
   }
+
+  @Override
+  @SuppressWarnings("checkstyle:HiddenField")
+  public ReplaceViewVersion updateColumnDoc(String name, String doc) {

Review Comment:
   Ah the awkward thing here is method chaining
   
   ```view.updateColumnDoc("col1", "docs").updateColumnDoc("col2", 
"docs").commit()``` will only update column2 since we're not really doing any 
change tracking.
   
   Maybe this should be updated to allow multiple column/docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442541492


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -128,4 +128,23 @@ public SQLViewRepresentation sqlFor(String dialect) {
 
 return closest;
   }
+
+  @Override
+  @SuppressWarnings("checkstyle:HiddenField")
+  public ReplaceViewVersion updateColumnDoc(String name, String doc) {

Review Comment:
   Ah the awkward thing is method chaining...
   
   ```
   view.updateColumDoc("col1", "doc1").updateColumnDoc("col2, "doc2").commit()
   ```
   
   This will only update col2 since we're not doing any change tracking or 
re-using any previous replaces.
   
   Maybe this API instead can accept multiple col/docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on code in PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#discussion_r1442541492


##
core/src/main/java/org/apache/iceberg/view/BaseView.java:
##
@@ -128,4 +128,23 @@ public SQLViewRepresentation sqlFor(String dialect) {
 
 return closest;
   }
+
+  @Override
+  @SuppressWarnings("checkstyle:HiddenField")
+  public ReplaceViewVersion updateColumnDoc(String name, String doc) {

Review Comment:
   Ah the awkward thing is method chaining...
   
   ```
   view.updateColumDoc("col1", "doc1").updateColumnDoc("col2, "doc2").commit()
   ```
   
   This will only update col2 since we're not doing any change tracking or 
re-using any previous replaces.
   
   Maybe this API instead can accept multiple col/docs, but probably if we go 
down this route it's best to just follow the existing pattern and do some 
change tracking



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Strip trailing slash for warehouse location [iceberg]

2024-01-04 Thread via GitHub


ajantha-bhat commented on code in PR #9415:
URL: https://github.com/apache/iceberg/pull/9415#discussion_r1442549368


##
nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java:
##
@@ -183,7 +184,8 @@ private String validateWarehouseLocation(String name, 
Map catalo
   catalogOptions);
   throw new IllegalStateException("Parameter 'warehouse' not set, Nessie 
can't store data.");
 }
-return warehouseLocation;
+
+return LocationUtil.stripTrailingSlash(warehouseLocation);

Review Comment:
   Just making this inline with other catalogs. 
   We can lookup the callers of `LocationUtil.stripTrailingSlash`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Strip trailing slash for warehouse location [iceberg]

2024-01-04 Thread via GitHub


ajantha-bhat commented on PR #9415:
URL: https://github.com/apache/iceberg/pull/9415#issuecomment-1878214701

   cc: @snazy, @dimas-b , @adutra   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Correct schema behavior [iceberg-python]

2024-01-04 Thread via GitHub


Fokko merged PR #247:
URL: https://github.com/apache/iceberg-python/pull/247


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs [iceberg]

2024-01-04 Thread via GitHub


amogh-jahagirdar commented on PR #9414:
URL: https://github.com/apache/iceberg/pull/9414#issuecomment-1878270693

   Hmafter thinking about this a bit more, I think we can remove the Schema 
API and add an API to ReplaceViewVersion `withUpdatedColumnDoc`. This API can 
do change tracking and then do all the schema changing logic internally. The 
Schema API isn't really needed to be public since this logic really is only 
needed for views.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org